Monday, December 20, 2010

PostgreSQL Performance vs. Microsoft SQL Server

A recent poster to the pgsql-performance mailing list enquired as to the relative performance of Microsoft SQL Server vs. PostgreSQL.  It's a reasonable question.  Switching databases can be a major project, and you certainly wouldn't want to do it and then find out at the end that you'd taken a huge performance hit and had to throw all your work away and switch back.  The good news is that this scenario is fairly unlikely.

Thursday, December 16, 2010

Two Hundred Commits

My first patch (add a separate TRUNCATE privilege) was committed to the PostgreSQL source code repository (at the time, CVS) by Tom Lane on September 8, 2008.  I became a committer a little over a year later, and the first commit I did myself was on December 10, 2009 (a patch from Marcin Mank to fix levenshtein_with_costs).  Of course, I screwed it up: the release team was in the middle of wrapping a minor release, and I back-patched the fix in the brief window after the release notes were written and before the release went out.

Wednesday, December 15, 2010

SE-Linux for PostgreSQL: Part 3

Back in September, I wrote two blog posts on SE-Linux integration for PostgreSQL.  In part 1, I discussed the work that had already been committed as of that time.  In part 2, I discussed what I learned at the meeting, and planned next steps.  Since then, a considerable amount of progress has been made, so it seems like a good time to revisit the issue.

Monday, December 13, 2010

Crunch Time for PostgreSQL 9.1

According to the PostgreSQL 9.1 development plan, the final CommitFest for PostgreSQL 9.1 development will begin in 33 days.  Approximately 30 days later, we'll stamp our final alpha release and begin preparations for PostgreSQL 9.1 beta.  This is exciting news, because I'm really looking forward to PostgreSQL 9.1.  It's also scary news, because there is a lot of work left to be done between now and then, and at least in the United States, Christmas is going to take a bite out of that time.

We have a number of very interesting, very significant features which were submitted to the 2010-11 CommitFest.  These include SQL/MED, extensions, synchronous replication, writeable CTEs, per-column collation, MERGE, checkpoint improvements, further infrastructure for SE-Linux integration, and my own work on unlogged tables.  While we've made significant progress on most of these features during this CommitFest, major work still remains to be done on nearly all of them, and none of them can be considered a sure thing for PostgreSQL 9.1.  It's possible - maybe even likely - that even more worthwhile features will be added to the queue between now and mid-January.

So it's crunch time.  We have about two months to define what PostgreSQL 9.1 will be.  Let's make the most of it.

Monday, December 06, 2010

The Cost of Page Faults

Over Thanksgiving, I wrote a bit about some work I did profiling PostgreSQL, and about squeezing a bit of overhead out of the backend shutdown process.  After making that change, I did some further profiling of connection startup/tearddown, and was dismayed to see that the revised profile looked pretty mundane, with most of the time being taken up by functions like memset() and memcpy() that are typically hard to optimize.

As it turns out, this profile wasn't really showing what I thought it was showing.  Andres Freund and Tom Lane theorized that the reason why memset() and memcpy() showed up so high in the profile was not because those operations were intrinsically expensive, but because those functions were triggering page faults.  Page faults occur when a process attempts to access a portion of its address space it hasn't previously touched, and the kernel must arrange to map that chunk of address space to an actual chunk of physical memory.  As it turns out, it appears that Andres and Tom were right: processing a page fault is 2 or 3 times more expensive than zeroing a page of memory.

I found this a bit surprising, because I'm in the habit of thinking of process startup on UNIX-like systems as being very cheap, but it appears that in this case there's so little actual work going on the page faults actually become the dominant cost.  This means that if we want to make a significant further reduction in our connection overhead, we're probably going to have to avoid starting a new process for each new connection.  I posted a few ideas on this topic, to which Tom Lane responded.  In short, there may be some benefit in making PostgreSQL follow a model more like Apache, where workers are spawned before they are actually needed, rather than on demand.  I don't presently have time to follow up on this, but I think it's got potential.

Friday, December 03, 2010

Troubleshooting Database Unresponsiveness

From time to time, we get complaints on the pgsql-performance mailing list about a PostgreSQL database that - for the most part - performs reasonably well, but every once in a while it becomes unresponsive for some number of seconds or even minutes; and then eventually recovers.  What's going on?

Monday, November 29, 2010

MySQL vs. PostgreSQL, Part 1: Table Organization

I'm going to be starting an occasional series of blog postings comparing MySQL's architecture to PostgreSQL's architecture.  Regular readers of this blog will already be aware that I know PostgreSQL far better than MySQL, having last used MySQL a very long time ago when both products were far less mature than they are today.  So, my discussion of how PostgreSQL works will be based on first-hand knowledge, but discussion of how MySQL works will be based on research and - insofar as I'm can make it happen - discussion with people who know it better than I do.  (Note: If you're a person who knows MySQL better than I do and would like to help me avoid making stupid mistakes, drop me an email.)

Thursday, November 25, 2010

Profiling PostgreSQL

I did a little bit of work Tuesday night and Wednesday profiling PostgreSQL.  I ran two different tests.  The first test was designed just to measure the overhead of repeatedly connecting to the database without doing anything, while the second test looked running pgbench with 36 concurrent threads.  The best thing that can happen to you when you fire up the profiler is to have something pop up the profile that you never would have expected.  At least in my experience, when you see what you expect to see, that typically means it's something you've already thought about optimizing, and is therefore probably reasonably efficient already.  When you see something totally unexpected, it's probably something you've never thought about optimizing, and of course the first optimization is always the easiest.

Anyhow, that's what happened to me with the "repeated connections" test.  It turns out that a big chunk of the CPU time was actually being spent during backend exit, rather than (as I had anticipated) backend startup.  We had a check in there to forcibly release any buffer pins that hadn't been cleaned up properly during normal execution.  Originally, this was probably a necessary and valuable check, but we've subsequently added much more thorough and robust cleanup mechanisms which should mean that this code never finds anything to release.  If it does find something, gracefully cleaning up the pin is the wrong idea: we want the code to yell and scream, so that we find and fix the underlying bug.

So, after some discussion with Tom Lane, I ripped this code out and replaced it with some code that will run only in assert-enabled builds (which are typically used only for development and debugging) that will check for leftover buffer pins and fail an assertion if any are found, which will hopefully make it easier to find any current or future bugs in this area.  In non-assert-enabled builds, we no longer do anything at all here (the best kind of optimization!).

Unfortunately, this was the only really surprising thing that popped up in the profiling results.  Further improvements are going to take a bit more work.

Happy Thanksgiving!

Monday, November 22, 2010

Index-Only Scans

There seems to be a lot of interest in the as-yet-unimplemented performance feature called index-only scans, so I thought it would be useful to explain a little bit more about what this feature is, how it will help PostgreSQL, and where the bodies are buried.

First, the name.  What do we mean by an index-only scan?  In PostgreSQL today, an index scan always accesses both the index itself and the underlying table.  You might think this unnecessary.  For example, if you have the query SELECT name FROM table WHERE id = 10, and there is an index on (id, name), you might assume that we could use the index to check for tuples with id = 10, and the if one is found, return the name directly from the index tuple, without consulting the underlying table.  Unfortunately, this does not work, because that tuple might not actually be one that the SELECT statement can see.  If the tuple was inserted by a transaction which began after the SELECT statement took its MVCC snapshot, or deleted by a transaction which committed before the SELECT statement took its MVCC snapshot, then the SELECT statement must not return it.  If it did, we would quickly get very surprising wrong answers out of the database.  So PostgreSQL first looks at the index tuple, and then the heap (table) tuple, decides what the right thing to do is, and does it.  By an index ONLY scan, we mean one which will look at just the index, and not at the corresponding table; the trick is to figure out how to make that happen without returning wrong answers.

Thursday, November 18, 2010

Best Patches of 9.1CF3

Back in July, I wrote a blog post on the best patches submitted for the first CommitFest for PostgreSQL 9.1 development (so far, the first two out of the three have been committed).  I didn't end up writing a similar post for the second CommitFest, because there wasn't a lot of stuff that really grabbed my attention, but the third CommitFest is here now, and there are a ton of exciting patches.

Wednesday, November 17, 2010

When Your Data Isn't Made of Gold

Josh Berkus' recent blog posting on What We Should Be Learning from MySQL, part 2 includes the following quote: "We tend to treat all data in Postgres as if it were made of gold, and not all data is equally valuable."  He goes on to wonder what we can do to better provide for the case where your data isn't made of gold.

Wednesday, November 10, 2010

Rob Wultsch's MySQL Talk at PostgreSQL West

I thought this talk deserved a blog post of its own, so here it is.  I have to admit that I approach this topic with some trepidation.  The MySQL vs. PostgreSQL debate is one of those things that people get touchy about.  Still, I'm pleased that not only Rob, but a number of other MySQL community members who I did not get a chance to meet, came to the conference, and it sounds like it will be our community's turn to visit their conference in April of next year.  Rob was kind enough to offer to introduce me to some of the MySQL community members who were there, and I, well, I didn't take him up on it.  That's something I'd like to rectify down the road, but unfortunately this was a very compressed trip for me, and the number of people I had time to talk to and meet with was much less than what I would have liked.

Monday, November 08, 2010

PostgreSQL West Talks

As I blogged about before the conference, I gave two talks this year at PostgreSQL West.  The first was a talk on the query optimizer, which I've given before, and the second talk was on using the system catalogs, which was new.  While the second one was well-attended, the first one was packed.  I keep hoping I'll think of something to talk about that people find even more interesting than the query planner, but so far no luck.  Slides for both presentations are now posted; I've added two slides to the system catalogs presentation that weren't there when I gave the talk, but probably should have been.

Nearly all the talks I attended were good.  Some of the best were Greg Smith's talk on Righting Your Writes (slides), Gabrielle Roth's talk on PostgreSQL monitoring tools, and Joe Conway's talk on Building an Open Geospatial Technology Stack (which was actually given in part by Jeff Hamann, who has a company, and a book).  All three of these, and a number of the others, were rich with the sort of anecdotal information that it's hard to get out of the documentation: How exactly do you set this up? How well does it actually work?  What are its best and worst points?

Another memorable talk was Rob Wultsch's talk entitled "MySQL: The Elephant in the Room".  But that talk really deserves a blog post all of its own.  Stay tuned.

Thursday, October 28, 2010

Here Comes PostgreSQL West

In just a few days, I'll be off to PostgreSQL West.  I've attended PostgreSQL East and PGCon both of the last two years, but this will be my first trip out to PG West.  As with past conferences, this will be a good opportunity for me to catch up with people I normally speak with only via the Internet.  But, there's something that's a little different about this one.  Take a look at the agenda.

Monday, October 25, 2010

WAL Reliability

I recently learned, somewhat to my chagrin, that operating systems are pathological liars, and in particular that they habitually lie about whether data has actually been written to disk.  If you use any database product, you should care about this, because it can result in unfixable, and in some cases undetected, corruption of your database.  First, a question.  On which of the following operating systems do fsync() and related calls behave properly out of the box?

A. Linux
B. Windows
C. MacOS

Thursday, October 14, 2010

Choosing a Datastore

In thinking about which database might be best for any particular job, it's easy to get lost in the PR. Advocates of traditional relational database systems like Oracle and PostgreSQL tend to focus on the fact that systems are feature-rich and provide features such as atomicity, consistency, isolation, and durability (ACID), while advocates of document databases (like MongoDB) and key-value stores (memcached, Dynamo, Riak, and many others) tend to focus on performance, horizontal scalability, and ease of configuration.  This is obviously an apples-and-oranges comparison, and a good deal of misunderstanding and finger-pointing can result.  Of course, the real situation is a bit more complicated: everyone really wants to have all of these features, and any trade-off between them is bound to be difficult.

Wednesday, October 06, 2010

Down To Six

From early July until the beginning of this week, the PostgreSQL project has been maintaining eight active branches: 7.4, 8.0, 8.1, 8.2, 8.3, 8.4, 9.0, and the master branch (9.1devel).   As a result, a significant number of bug fixes and security updates had to be back-patched into all of those releases.  At least for me, the recent switch to git has made back-patching, at least for simple cases, a whole lot simpler.  But it's still a fair amount of work - some parts of the code have changed a good deal since 2003, when 7.4 was released.

Monday, October 04, 2010


Bruce Momjian and I spent Thursday and Friday of last week in Baltimore, attending Surge.  It was a great conference.  I think the best speakers were Bryan Cantrill of Joyent (@bcantrill), John Allspaw of Etsy (@allspaw), and Artur Bergman of Wikia (@crucially), but there were many other good talks as well.  The theme of the conference was scalability, and a number of speakers discussed how they'd tackled scalability challenges.  Most seem to have started out with an infrastructure based on MySQL or PostgreSQL and added other technologies around the core database to improve scalability, especially Lucene and memcached.  But there were some interesting exceptions, such as a talk by Mike Malone wherein he described building a system to manage spatial data (along the lines of PostGIS) on top of Apache Cassandra.

Some general themes I took away from the conference:

Tuesday, September 28, 2010

Stupid Git Tricks for PostgreSQL

Even before PostgreSQL switched to git, we had a git mirror of our old CVS repository.  So I suppose I could have hacked up these scripts any time.  But I didn't get around to it until we really did the switch.  Here's the first one.  It's a one-liner.  For some definition of "one line".

git log --format='%H' --shortstat `git merge-base REL9_0_STABLE master`..master | perl -ne 'chomp; if (/^[0-9a-f]/) { print $_, " "; } elsif (/files changed/) { s/^\s+//; my @a = split /\s+/; print $a[3] + $a[5], "\n" }' | sort -k2 -n -r | head | cut -d' ' -f1 | while read commit; do git log --shortstat -n 1 $commit | cat; echo ""; done

This will show you the ten "biggest" commits since the REL9_0_STABLE branch was created, according to number of lines of code touched.  Of course, this isn't a great proxy for significance, as the output shows.  Heavily abbreviated, largest first:

66424a284879b Fix indentation of verbatim block elements (Peter Eisentraut)
9f2e211386931 Remove cvs keywords from all files (Magnus Hagander)
4d355a8336e0f Add a SECURITY LABEL command (Robert Haas)
c10575ff005c3 Rewrite comment code for better modular
ity, and add necessary locking (Robert Haas)
53e757689ce94 Make NestLoop plan nodes pass outer-relation variables into their inner relation using the general PARAM_EXEC executor parameter mechanism, rather than the ad-hoc kluge of passing the outer tuple down through ExecReScan (Tom Lane)
5194b9d04988a Spell and markup checking (Peter Eisentraut)
005e427a22e3b Make an editorial pass over the 9.0 release notes. (Tom Lane)
3186560f46b50 Replace doc references to install-win32 with install-windows (Robert Haas)
debcec7dc31a9 Include the backend ID in the relpath of temporary relations (Robert Haas)
2746e5f21d4dc Introduce latches. A latch is a boolean variable, with the capability to wait until it is set (Heikki Linnakangas)

Of course, some of these are not-very-interesting commits that happen to touch a lot of lines of code, but a number of them represented significant refactoring work that can be expected to lead to good things down the line.  In particular, latches are intended to reduce replication latency and eventually facilitate synchronous replication; and Tom's PARAM_EXEC refactoring is one step towards support for the SQL construct LATERAL().

OK, one more.


BP=`git merge-base master REL9_0_STABLE`

git log --format='format:%an' $BP..master | sort -u |
while read author; do
    echo "$author: \c"
    git log --author="$author" --numstat $BP..master |
    awk '/^[0-9]/ { P += $1; M += $2 }
         /^commit/ { C++ }
         END { print C " commits, " P " additions, " M " deletions, " (P+M) " total"}'

This one shows you the total number of lines of code committed to 9.1devel, summed up by committer.  It has the same problem as the previous script, which is that it sometimes you change a lot of lines of code without actually doing anything terribly important.  It has a further problem, too: it only takes into account the committer, rather other important roles, including reporter, authors, and reviewers.  Unfortunately, that information can't easily be extracted from the commit logs in a structured way.  I would like to see us address that defect in the future, but we're going to need something more clever than git's Author field.  Most non-trivial patches, in the form in which they are eventually committed, are the work of more than one person; and, at least IMO, crediting only the main author (if there even is one) would be misleading and unfair in many cases.

I think the most interesting tidbit I learned from playing around with this stuff is that git merge-base can be used to find the branch point for a release.  That's definitely handy.

Friday, September 24, 2010

Enjoying Git

OK, I admit it.  This is awesome.  I'm still getting used to committing to PostgreSQL with git rather than CVS, but it's sort of like the feeling of being let out of the dungeon.  Wow, sunlight, what am I supposed to do about that?

Actually, I've never really been into CVS bashing; it's an OK system for what it does.  And compare to RCS, which I actually used once or twice a long time ago, it's positively phenomenal.  But git, despite its imperfections, is just a lot better.

There are two major things that caused problems for me when committing to CVS.  First, it was painfully slow.  Second, since I was doing all of my development work on git, that meant extracting the patch, applying it to CVS, making sure to CVS add/rm any new/deleted files, retyping (or copying) the commit message, and double-checking that I hadn't messed anything up while moving the patch around.

$ git commit
$ git show
$ git push

Nice!  I feel like someone gave me an easy button.

Saturday, September 18, 2010

Git Conversion, Take Two

The PostgreSQL project will be making its second attempt to migrate from CVS to git this coming Monday.  In a previous blog post, I talked about some of the difficulties we've had getting a clean conversion of our project history to git.  I was surprised that a number of people suggested throwing out our development history and just moving the head of each branch to git; and I agree with some of the later comments that this would be a bad idea.  I refer back to our development history fairly frequently, for a variety of reasons: to determine when particular features were introduced, to determine what patch last touched a particular area of the code, to see how old a particular bit of code is, and sometimes even to model a new patch on a previous patch that added a similar feature.  So I'd find it very painful to lose convenient access to all of that history.  Even a somewhat messed-up conversion would be better than no conversion at all.

Fortunately, it looks like we're going to end up with a pretty darn good conversion.  Tom Lane spent most of last weekend cleaning up most of the remaining infelicities.  The newest conversions are a huge improvement over both our current, incrementally-updated conversion (which is what I use for day to day development) as well as earlier attempts at a final conversion.  Only a handful of minor artifacts remain, mostly because of wacky things that were done in CVS many years ago.  Our use of CVS in recent years has been quite disciplined, which is why such a clean conversion is possible.

Tuesday, September 14, 2010

SE-Linux For PostgreSQL: Part 2

In part 1 of this blog post, I talked about my recent visit to BWPUG to discuss SE-Linux and PostgreSQL, and reviewed the work that has been done so far, as well as a few upcoming patches.  In this second part of the post, I'm going to review what I learned at the meeting and where the SE-Linux folks told me they'd like to see us go with this.

One of the most interesting concepts we discussed was the idea of a type transition.  This may be old hat for experienced SE-Linux users, but it was a new idea for me.  I'm not sure that I understand this concept in its full generality, but Joshua Brindle explained two specific applications of it to me.  First, when objects are created, SE-Linux apparently allows the context of that object to depend not only on the context of the creator, but also on where the object was created.  For example, if Apache is running in a context called apache_t and creates a temporary file in /tmp, the context of the new file might be apache_tmp_t.  Similarly, if a PostgreSQL client creates a table, the SE-Linux folks would like to be able to determine the security label for the table based on a combination of the client's context and the containing schema's label.

The second application of type transitions which we discussed in the meeting related to what KaiGai Kohei has been calling a trusted procedure.  The idea here seems to be that when a certain function is executed, SE-Linux should have the option based on the user's context and the function's context to transition to a new security context for just the period of time during which the function is executing.  This doesn't involve a kernel call: it's just internal recordkeeping.  I'm imaging that SE-Linux support for PostgreSQL will be provided by a loadable module, so essentially we'd need a bullet-proof way of allowing the SE-Linux module to gain control briefly at function entry and exit time (and that would be certain to be called even if, say, we exit the function due to an error condition).

We also talked about SE-Linux control of operations other than DML, which is what the ExecCheckRTPerms hook I talked about in part 1 of this posting will support.  In particular, Joshua Brindle and David Quigley were very concerned about proper control over the LOAD statement.  It looks like this can be easily accomplished using the existing ProcessUtility_hook.  They were also concerned about DDL, but again it seems like the existing ProcessUtility_hook would be sufficient to impose coarse-grained restrictions.  Ultimately, that may not be the best way to go forward, as it may not provide easy access to all the bits they care about - in particular, I think we will need one or more special-purpose hooks in ALTER TABLE - but it may be enough to do something crude.

Another very large hole that will need to be plugged is control over large objects.  These will need security labels and appropriate access checks.

Finally, the SE-Linux folks indicated that in the long run they would really like to have row-level access control, but they believe that they can accomplish useful things with an implementation which does not include that capability, as long as they have the "trusted procedure" facility discussed above.

I'm not sure how far we're going to get with this work during the PostgreSQL 9.1 time frame.  KaiGai Kohei has poured a tremendous amount of time into this work over the last several years, but progress has been slow.  I think one of the big reasons for that is that doing this work in a way that is acceptable to the PostgreSQL community can sometimes require significant refactoring of the existing code.  It's not always obvious how to accomplish that, and many of the people who are in the best position to carry it off successfully can't put a lot of time into it unless there is funding attached.  So far, no one stepped forward in this area; if that changes, I expect to see much more rapid progress.

Sunday, September 12, 2010

So, Why Isn't PostgreSQL Using Git Yet?

Just over a month ago, I wrote a blog posting entitled Git is Coming to PostgreSQL, in which I stated that we planned to move to git sometime in the next several weeks.  But a funny thing happened on the way to the conversion.  After we had frozen the CVS repository and while Magnus Hagander was in the process of performing the migration, using cvs2git, I happened to notice - just by coincidence - that the conversion had big problems.  cvs2git had interpreted some of the cases where we'd back-patched commits from newer branches into older branches as merges, and generated merge commits.  This made the history look really weird: the merge commits pulled in the entire history of the branch behind them, with the result that newer commits appeared in the commit logs of older branches, even we didn't commit them there and the changes were not present there.

Fortunately, Max Bowsher and Michael Haggerty of the cvs2git project were able to jump in and help us out, first by advising us not to panic, and secondly by making it possible to run cvs2git in a way that doesn't generate merge commits.  Once this was done, Magnus reran the conversion.  The results looked a lot better, but there were still a few things we weren't quite happy with.  There were a number of "manufactured commits" in the history, for a variety of reasons.  Some of these were the result of spurious revisions in the CVS history of generated files that were removed from CVS many years ago; Max Bowsher figured out how to fix this for us.  Others represented cases where a file was deleted from the trunk and then later re-added to a back branch.  But because we are running a very old version of CVS (shame on us!), not enough information was recorded in the RCS files that make up the CVS repository to reconstruct the commit history correctly.  Tom Lane, again with help from the cvs2git folks, has figured out how to fix this.  We also end up with a few spurious branches (which are easily deleted), and there are some other manufactured commits that Tom is still investigating.

In spite of the difficulties, I'm feeling optimistic again.  We seem to have gotten past the worst of the issues, and seem to be making progress on the ones that remain.  It seems likely that we may decide to postpone the migration until after the upcoming CommitFest is over (get your patches in by September 14!) so it may be a bit longer before we get this done - but we're making headway.

Friday, September 10, 2010

SE-Linux For PostgreSQL: Part 1

I made the trip down to OmniTI headquarters just south of Baltimore, MD this Wednesday for BWPUG. This month's topic was the ongoing project to integrate SE-Linux with PostgreSQL. Besides myself, Stephen Frost, Greg Smith, Robert Treat were all there from the PostgreSQL community, along with David Quigley and Joshua Brindle from the SE-Linux community. It was a very productive meeting and I learned a lot about what the SE-Linux community is looking for from PostgreSQL.

We first discussed the current status of the project. Following discussions with Stephen Frost, KaiGai Kohei, and Greg Smith at PGCon 2010, I wrote and committed four patches which have, I think, helped to clear the way for an eventual loadable module implementing basic SE-Linux support for PostgreSQL; and I also committed a fifth patch by KaiGai Kohei. These were, in order of commit:

1. Add a hook in ExecCheckRTPerms(). This is just a very simple hook to allow a loadable module to gain control at the time DML relation permissions are checked. Whenever a SELECT, INSERT, UPDATE, or DELETE statement is invoked, this function gets a listed of the relation OIDs and can choose to allow the statement to proceed or throw an error. (It could also potentially do other things, like write to a log file, if that were useful for some reason.)

2. Centralize DML permissions-checking logic. KaiGai Kohei spotted the fact that the previous patch didn't actually work for a couple of important cases. In particular, COPY did not previously go through ExecCheckRTPerms(), and there is some hairy code inside the foreign key stuff that also needed adjustment to work properly with this hook. This patch, by KaiGai Kohei, cleaned all of that up. So as far as I know, we now have a single point for all DML permissions checking, and a hook function at that point. Yeah!

Unfortunately, in order to do label-based security, a simple hook function is not enough. You also need a place to store the labels, and ideally that place should be a PostgreSQL system catalog. I had initially thought that we would add a security label column to the system catalog for each object type, but that would require fairly invasive changes across the whole system and carry some minor performance penalty even for people who did not use it. At PGCon, we came up with the idea of storing all security labels for all objects in a separate catalog. Security labels are, in essence, just strings, which we don't try to interpret but which have some meaning (the details of which we need not understand) to an external security provider such as SE-Linux.

As luck would have it, we already have a model for such a facility: the COMMENT statement already knows how to store arbitrary strings which it does not attempt to interpret for arbitrary database objects, using a catalog (actually two catalogs) dedicated to that purpose. Unfortunately, the comment code is quite large, and, as it turned out, buggy, so it didn't seem like a good idea to copy-and-paste it into a new file and then hack it up from there, as I had initially hoped to do. So that led to three more patches.

3. Standardize get_whatever_oid functions for object types with unqualified names. As it turns out, one of the things that the comment code needed to do over and over again was examine the parse tree representation of an object and convert it to an OID by looking up the name in a system catalog. But there wasn't any standard way to do this, and in some cases the code was quite lengthy and already duplicated in multiple places throughout our source base. This patch cleaned that up, by introducing a standard API and adjusting the existing OID-getter functions, or adding new ones, for tablespaces, databases, roles, schemas, languages, and access methods, to conform to that API.

4. Standardize get_whatever_oid functions for other object types. More of the same, this time for text search parsers, dictionaries, templates, and configs; as well as for conversions, constraints, operator classes, operator families, rules, triggers, and casts.

5. Rewrite comment code for better modularity, and add necessary locking. This patch took the refactoring in the previous two patches one step further. The functions in the previous patches provide a way to translate a named object of a different type to an OID. This patch creates a more general API that can be passed an object type and a parse tree and return an ObjectAddress, which is an internal representation that can point to a database object of any type. The ObjectAddress representation is used for management of dependencies between database objects (e.g. you can't drop a table if there's a view using it, unless you also drop the view) as well as by the comment code, and they will be useful for security label support as well.

This new facility also fixes a longstanding locking bug in the COMMENT code, which still exists (and likely won't be fixed) in 9.0 and all prior releases. An object that is dropped concurrently with a COMMENT operation on that same object could lead to an orphaned comment in the pg_description or pg_shdescription catalog. If another object of the same type is subsequently assigned the same OID, it will inherit the orphaned comment. This is fairly unlikely and, for comments, fairly innocuous, but it would obviously create a potential security hole for security labels.

With these preliminary patches in place, I think we're now well-positioned to introduce the major piece of functionality which we will need to support SE-Linux integration: an in-core security label facility for use by SE-Linux and perhaps other label-based security systems. Stephen Frost, KaiGai Kohei, and I have had extensive discussions about the design of this facility and there are currently two pending patches by KaiGai Kohei which are intended to implement that design: one adds the basic security facility and commands for manually applying labels, and the other adds hooks at table creation time to allow enhanced security providers to automatically set a label on newly created tables. I have not yet reviewed these patches in detail, but I hope to see them committed - likely with some modifications - within the next month.

In the second part of this blog post, I'll go over what I learned from David and Joshua (who were extremely helpful in explaining SE-Linux to me), the additional facilities which they felt would be necessary for a minimally useful SE-Linux integration, and what they'd like to see over the longer term.

Tuesday, August 24, 2010

Version Numbering

Over the last few days, there's been a debate raging on pgsql-hackers on the subject of version numbering. There are many thoughtful (and some less-thoughtful) opinions on the thread that you may wish to read, but I thought the most interesting was a link posted by Thom Brown to a blog post called The Golden Rules of Version Naming. If you haven't seen it, it's definitely worth a read.

Monday, August 16, 2010

Why We're Conservative With PostgreSQL Minor Releases

Last week, a PostgreSQL user filed bug #5611, complaining about a performance regression in PostgreSQL 8.4 as compared with PostgreSQL 8.2. The regression occurred because PostgreSQL 8.4 is capable of inlining SQL functions, while PostgreSQL 8.2 is not. The bug report was also surprising to me, because in my experience, inlining SQL queries has always improved performance, often dramatically. But this user managed unluckily hit a case where the opposite is true: inlining caused a function which had previously been evaluated just once to be evaluated multiple times. Fortunately, there is an easy workaround: writing the function using the "plpgsql" language rather than the "sql" language defeats inlining.

Although the bug itself is interesting (let's face it, I'm a geek), what I found even more interesting was that I totally failed to appreciate the possibility that inlining an SQL function could ever fail to be a performance win. Prior to last week, if someone had asked me whether that was possible, I would have said that I didn't think so, but you never know...

And that is why the PostgreSQL project maintains stable branches for each of our major releases for about five years. Stable branches don't get new features; they don't get performance enhancements; they don't even get tweaks for things we wish we'd done differently or corrections to behavior of doubtful utility. What they do get is fixes for bugs (like: without this fix, your data might get corrupted; or, without this fix, the database might crash), security issues, and a smattering of documentation and translation updates. When we release a new major release (or actually about six months prior to when we actually release), development on that major release is over. Any further changes go into the next release.

On the other hand, we don't abandon our releases once they're out the door, either. We are just now in the process of ceasing to support PostgreSQL 7.4, which was released in November 2003. For nearly seven years, any serious bugs or security vulnerabilities which we have discovered either in that version or any newer version have been addressed by releasing a new version of PostgreSQL 7.4; the current release is 7.4.29. Absent a change in project policy, 7.4.30 will be the last 7.4.x release.

If you're running PostgreSQL 8.3 or older, and particularly if you're running PostgreSQL 8.2 or older, you should consider an upgrade, especially once PostgreSQL 9.0 comes out. Each release of PostgreSQL includes many exciting new features: new SQL constructions, sometimes new data types or built-in functions, and performance and manageability enhancements. Of course, before you upgrade to PostgreSQL 8.4 (or 9.0), you should carefully test your application to make sure that everything still works as you expect. For the most part, things tend to go pretty smoothly, but as bug #5611 demonstrates, not always.

Of course, this upgrade path is not for everyone. Application retesting can be difficult and time-consuming, especially for large installations. There is nothing wrong with staying on the major release of PostgreSQL that you are currently using. But it is very wise to upgrade regularly to the latest minor version available for that release. The upgrade process is generally as simple as installing the new binaries and restarting the server (but see the release notes for your version for details), and the PostgreSQL community is firmly committed to making sure that each of these releases represents an improvement to performance and stability rather than a step backwards.

Friday, August 13, 2010

How I Hack on PostgreSQL

Today's post by Dimitri Fontaine gave me the idea of writing a blog posting about the tools I use for PostgreSQL development. I'm not saying that what I do is the best way of doing it (and it's certainly not the only way of doing it), but it's one way of doing it, and I've had good luck with it.

What commands do I use? The following list shows the ten commands that occur most frequently in my shell history.

[rhaas pgsql]$ history  | awk '{print $2}' | sort | uniq -c | sort -rn | head
250 git
57 vi
31 %%
25 cd
24 less
20 up
18 make
13 pg_ctl
10 ls
8 psql

Wow, that's a lot of git. I didn't realize that approximately half of all the commands I type are git commands. Let's see some more details.

[rhaas pgsql]$ history  | awk '$2 == "git" { print $3}' | sort | uniq -c | sort -rn | head
93 diff
91 grep
15 log
10 commit
8 checkout
7 add
6 reset
5 clean
4 pull
3 branch

As you can see, I use git diff and git grep far more often than any other commands. The most common things I do with git diff are just plain git diff, which displays the unstaged changes in my working tree (so I can see what I've changed, or what a patch I've just applied has changed) and git diff master (which shows all the differences between my working tree and the master branch; this is because I frequently use git branches to hack on a patch I'm working on). A great deal of the work of writing a good patch - or reviewing one - consists in looking at the code over and over again and thinking about whether every change can be justified and proven correct.

git grep does a recursive grep starting at the current directory, but only examines files checked into git (not build products, for example). I use this as a way to find where a certain function is defined (by grepping for the name of the function at the start of a line) and as a way to find all occurrences of an identifier in the code (which is an absolutely essential step in verifying the correctness of your own patch, or someone else's).

As you can also see, my preferred editor is vi (really vim). This might not be the best choice for everyone, but I've been using it for close to 20 years, so it's probably too late to learn something else now. I think Dimitri Fontaine said it well in the post linked above: the best editor you can find is the one you master. Having said that, if you do even a small amount of programming, you're likely to spend a lot of time in whatever editor you pick, so it's probably worth the time it takes to learn a reasonably powerful one.

Saturday, August 07, 2010

Git is Coming to PostgreSQL

As discussed at the PGCon 2010 Developer Meeting, PostgreSQL is scheduled to adopt git as its version control system some time in the next few weeks. Andrew Dunstan, who maintains the PostgreSQL build farm, has adapted the build farm code to work with either CVS or git; meanwhile, Magnus Hagander has done a trial conversion so that we can all see what the new repository will look like. My small contribution was to write some documentation for the PostgreSQL committers, which has subsequently been further edited by Heikki Linnakangas (the link here is to his personal web page, whose one complete sentence is one of the funnier things I've read on the Internet).

I don't think the move to git is going to be radical change; indeed, we're taking some pains to make sure that it isn't. But it will make my life easier in several small ways. First, the existing git clone of the PostgreSQL CVS repository is flaky and unreliable. The back-branches have had severe problems in this area for some time (some don't build), and the master branch (aka CVS HEAD) has periodic issues as well. At present, for example, the regression tests for contrib/dblink fail on a build from git, but pass on a build from CVS. While we might be able to fix (or minimize) these issues by fixing bugs in the conversion code, switching to git should eliminate them. Also, since I do my day-to-day PostgreSQL work using git, it will be nice to be able to commit that way also - it should be both faster (CVS is very slow by comparison) and less error-prone (no cutting and pasting the commit message, no forgetting to add a file in CVS that you already added in git).

Thursday, July 29, 2010

Multi-Tenancy and Virtualization

In a recent blog post on Gigaom, Simeon Simeonov argues that virtualization is on the way out, and discusses VMware's move toward platform-as-a-service computing. In a nutshell, his argument is that virtualization is inefficient, and is essentially a last resort when legacy applications can't play nicely together in the same sandbox. In other words, the real goal for IT shops and service providers is not virtualization per se, but multi-tenancy, cost-effective use of hardware, and high availability. Find any two servers in the average corporate data center and ask why they're not running on the same machine. It's a good bet you'll get one of the following four answers: (1) machine A is running a piece of software that misbehaves if run on the same machine as some piece of software running on machine B, (2) a single server couldn't handle the load, (3) one of those servers provides redundancy for the other, or (4) no particular reason, but we haven't gotten around to consolidating them yet. In my experience, the first answer is probably the most common. But as Simeonov points out, the ideal solution is not virtualization, but better software - specifically, platforms that can transparently service multiple customers.

PostgreSQL is very strong in this area. Hosting providers such as provision databases for multiple customers onto a single PostgreSQL instance; and here at EnterpriseDB, we support several customers who do much the same thing. Databases in PostgreSQL provide a high degree of isolation: many configuration parameters can be set on a per-database basis, extensions can be installed into a single database without affecting other databases that are part of the same instance, and each database can in turn contain multiple schemas. The ability to have multiple databases, each containing multiple schemas, makes the PostgreSQL model more flexible than Oracle or MySQL, which have only a single tier system. In the upcoming PostgreSQL 9.0 release, the new grant on all in schema and default privileges features will further simplify user administration in multi-user and multi-tenant environments. Behind the scenes, a PostgreSQL instance uses a single buffer pool which can be efficiently shared among any number of databases without excessive lock contention. This is critical. Fragmenting memory into many small buffer pools prevents databases from scaling up (using more memory) when under heavy load, and at the same time prevents databases from scaling down (using less memory) when not in use. By managing all databases out of a single pool, PostgreSQL can allow a single database to use every block in the buffer pool - if no other databases are in use - or no blocks at all - if the database is completely idle.

Simeonov seems to feel that virtualization has already nearly run its course, and predicts that the market will hit its peak within three years. That doesn't seem likely to me. I think there is an awful lot of crufty hardware and software out there that could benefit from virtualization, but it's working right now, so no one is eager to make changes that might break something. As the physical equipment starts to fail, IT administrators will think about virtualization, but hardware that isn't touched can sometimes run for a surprisingly long time, so I don't expect server consolidation projects to disappear any time soon. More importantly, Simeonov seems to assume that all new applications will be developed using platform-as-a-service architectures such as Google App Engine, Bungee, Engine Yard, and Heroku. While some certainly will be, it seems unlikely that the traditional model of application development, using a dedicated web server and a dedicated database running on a physical or virtual machine will disappear overnight. For one thing, choosing one of those vendors means being locked into that vendor's API - and choice of programming language. Bungee and Heroku are Ruby environments, for example, while Google App Engine offers Java and Python. Good luck making the switch!

So, if plain old virtual machines are going to be around for a while, how does PostgreSQL stack up in that environment? Not too bad. Of course, write-intensive workloads will suffer from the generalized slowness of virtualized I/O. But PostgreSQL is designed to run well even in a very small memory footprint, to take good advantage of the OS buffer cache and process scheduler, and to be portable across a wide variety of platforms. If your database is small enough to fit in memory, performance should be good. And if your database isn't small enough to fit in memory, there's not much point in virtualizing it: you're going to need a dedicated machine either way.

Sunday, July 25, 2010

Google and our Documentation

As I mentioned in a previous blog post, trying to find pages in the PostgreSQL documentation using Google doesn't work very well: most often, one gets links to older versions.

A recent thread on pgsql-performance (somewhat off-topic for that mailing list, but that's where it was) suggested that perhaps we could use Google's canonical URL feature to work around this problem.

Another suggestion was that we ask people who link to our docs to link to (or some sub-page) rather than linking to a specific version (e.g. the same URL with 8.4 in place of current). That way, as new versions come out, everyone's links will still be pointing at the latest version of the docs, helping the new versions accumulate "Google karma" more quickly than they would otherwise. Or at least, that's the idea: I have no idea whether it would actually work.

Thursday, July 22, 2010

Best Patches of 9.1CF1

Although PostgreSQL 9.0 isn't out yet, we began the first CommitFest for PostgreSQL 9.1 development on July 15, 2010. Our goal is to review every patch submitted by then before August 15. While we're only a week into the CommitFest, I already have some favorite patches: none of which are committed yet, so they might die, get withdrawn, changed, etc. But here they my top picks.

1. Simon Riggs wrote a very nice patch to reduce the lock level required for various DDL statements. We haven't yet come up with clearly workable ideas for allowing multiple DDL statements to execute on the same table at the same time, but what this patch will do is allow certain DDL commands to run in parallel with DML. Some versions of ALTER TABLE will lock out everything (as they all do, presently), some will lock out INSERT/UPDATE/DELETE/VACUUM statements but allow SELECT to run in parallel, and some will only lock out concurrent DDL and VACUUM operations (like ALTER TABLE ... SET WITHOUT CLUSTER). This should be really nice for anyone administering a busy database.

2. My employer, EnterpriseDB, asked me to write a patch that reduces the size of most numeric values on disk. This was based on a design proposal from Tom Lane a few years ago, and turned out to be pretty simple to code up. Currently, our numeric data type always has a 4-byte header specifying the weight of the first digit and display scale. For the values people typically do store, that's overkill. This patch allows a 2-byte header to be used opportunistically, when we can cram everything in; but the old format can still be understood, so it doesn't break pg_upgrade. It'll be interesting to see whether people can see a noticeable change in the size of their disk footprint when this patch is used. And maybe we could even get by with a 1-byte header sometimes... but that's a thought for another patch.

3. Kevin Grittner posted a patch to implement true serializability. I haven't studied the code in detail, and I'm not sure how soon we can hope to see this committed, but it's pretty cool. Our current serialization techniques are pretty good, but this should be a whole lot better whose application logic relies heavily on the absence of serialization anomalies.

Wednesday, July 07, 2010

Distributed Serialization Anomalies

One of the more difficult responsibilities of a database is to provide you with the illusion that transactions on the system are executed sequentially, one after another, while in fact allowing as much parallelism as possible. PostgreSQL's MVCC implementation does this using "snapshots": each statement (or, if you choose the serializable isolation level, each transaction), upon first access to the database, records which transactions have committed as of that moment, and everything it does afterwards will see the effect of those transactions, but not any transactions committed later. (There are some exceptions to this rule when using READ COMMITTED mode with INSERT, UPDATE, or DELETE statements.)

This produces, more or less, the illusion that SQL statements execute sequentially, with each one completing its work before the next one begins. This illusion is extremely important and valuable in many real-world applications. For example, if you transfer money from your checking account to your savings account, a banking application might insert two new rows into the "banking_transactions" table: one to show the debit from checking, and another to show the credit to savings. It wouldn't be good if some other query saw just one of these two new rows: it would look as if the money disappeared from checking but did not appear in savings, or perhaps as if it had appeared in savings without disappearing from checking. You'd be unhappy about the first scenario, and the bank would be unhappy about the second one. This type of scenario is called a serialization anomaly, and databases are responsible for preventing them. In this case, it's pretty easy to make sure this problem can't happen: just do both inserts within a single transaction, and then commit it.

Things get a little trickier when there's more than one database involved. Suppose that I'm moving money from my account (at one branch) to my friend Magnus's account (at a different branch of the same bank). As before, we must make two transaction entries: one showing the debit to my account, and the other showing the credit to his account. We can start transactions on both nodes and do the inserts, but it's not possible to commit both transactions at the very same instant: there could always be a crash after one transaction commits, but before the other one commits.

We can work around this problem to some extent using a protocol called two-phase commit: we'll issue a "PREPARE TRANSACTION" command in both transactions, which should be enough to guarantee that a subsequent "COMMIT PREPARED" command, even after an intervening crash, has no chance of failure. So, we start a transaction on each database, do an insert on each database, prepare both transactions, and then commit both transactions. If there's a crash (or loss of connectivity) after either transaction is prepared but before both transactions are committed, we can still get things back to a consistent state once things are back up again. How? We look to see if either transaction committed; if so, we commit the other one. If not, we see whether both transactions were succesfully prepared; if so, we can commit or abort both; if not, we must abort both.

This solves the problem of making sure that no money can be permanently lost (or created), but there will still be a period of time during which we can see inconsistent views of the system as a whole. Imagine that the bank auditor comes along and runs a report across all bank branches adding up the bank's assets and liabilities. It's possible that he'll query one of the two databases involved in our hypothetical funds transfer before the transaction commits on that node, but by the time he visits the other one, it's committed - therefore he'll see the transferred funds either in both accounts, or in neither one, depending on the order in which he hits the different branches. This is a distributed serialization anomaly.

Distributed serialization anomalies are much harder to avoid than regular serialization anomalies (which are a hard problem all by themselves). One method - which is used by Postgres-XC - is to have a single authority (which Postgres-XC calls a global transaction manager) which hands out snapshots and transaction IDs across all nodes in the cluster; regrettably, there is a potential for this to become a bottleneck, or a single point of failure (see Postgres-XC_Write-Scalable_Cluster.pptx, slides 10 and following).

Unfortunately, there may not be many good alternatives. There is a technology called commitment ordering which seems to have a long paper trail[1] in the academic literature, and which has been studied in relation to MVCC. The good news is that commitment ordering does not require a global coordinator of any kind; each node operates independently and does not even need to know the identities of the other nodes, or even how many exist. It requires no additional communication of any kind. The bad news is that it operates by aborting potentially problematic transactions, and it might end up aborting quite a lot of them. The rule is simply that the serialization order must match the commit order; so if transaction A reads x and writes y, transaction B reads y; and then transaction A commits, the system will abort B (because there could be a read-write dependency cycle between A and B involving another database).

Another alternative is to build up a commit-order dependency graph that spans all the databases involved in the transaction. That is, we imagine a graph with each unaborted transaction as a vertex. If A reads or updates a row and B subsequently updates it, we add an edge from A to B. If A updates a row and B subsequently reads the updated version (or a later version), we also add an edge from A to B. If, at any time, adding an edge to the graph would create a cycle, we abort one of the constituent transactions. Kevin Grittner and Emmanuel Cecchet pointed out a paper by Michael Cahill on this topic[2]; one of the advantages of this approach is that it is possible to prevent all serialization anomalies, which our current approach does not. Kevin and Dan Ports have proposed a patch for 9.1 which would implement true serializability for a single PostgreSQL database, but it's not clear that this would scale well to a distributed system.

[1] e.g. The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource-Managers, Yoav Raz, 1990 [PDF].
[2] Serializable Isolation for Snapshot Databases, Michael J. Cahill, Uwe Röhm, Alan D. Fekete, 2006 [PDF].

Monday, July 05, 2010

Concurrent Development

PostgreSQL 9.0 beta 3 will be wrapped in the next few days, and at the same time, we'll be branching the tree to begin 9.1 development. This is a new thing for us. In the past, we've waited until the previous release was shipped before opening the tree to new development. However, at the PGCon 2010 development meeting, we decided to try something different this time.

I believe that the primary motivation for this change was that, as we get closer to release, there are fewer and fewer issues to work on, and fewer and fewer people who can be involved in fixing them. So, waiting until release to branch the tree leaves a substantial portion of the developer community sitting idle. A second advantage is that it shortens the time between releases - our tentative plan is to use the same release schedule for 9.1 that we did for 9.0. The first CommitFest for 9.0 began on July 15, 2009, and the first CommitFest for 9.1 will begin on July 15, 2010; the last CommitFest for 9.0 began on January 15, 2010, and the last CommitFest for 9.1 will begin on January 15, 2011. Of course, the actual release date will almost certainly be different, but the plan is for feature freeze to happen about the same time next year that it did this year, so that we can continue to have releases about a year apart.

Of course, the danger of concurrent development is that the work people are doing for 9.1 may distract us from finishing 9.0. Hopefully that won't happen, because I think there is a lot to like about the new process.

Thursday, June 24, 2010

PostgreSQL as an In-Memory Only Database

There's been some recent, interesting discussion on the pgsql-performance mailing list on using PostgreSQL as an in-memory only database. In other words, you basically want to use it as a cache, similar to the way that you would use memcached or a NoSQL solution, but with a lot more features.

If you're interested in doing this, you'll need to configure the system so that you have a convenient, automatic way erase the database cluster and reinitialize it (using initdb) after an operating system crash. Per discussion on the mailing list, for best performance, it seems best to set up the data directory on a tmpfs and configure the following parameters in postgresql.conf:


With fsync=off, and most likely also with full_page_writes=off, your database will not be crash-safe - but you don't care, because you're planning to start from scratch after a crash anyway. If you're familiar with postgresql.conf parameters, setting synchronous_commit=off might seem redundant if you've already set fsync=off, but testing reveals that it still boosts performance. Turning off full_page_writes and bgwriter_lru_maxpages eliminates I/O that isn't needed for this use case.

On a related note, Gavin Roy gave a talk at PGCon comparing the performance of PostgreSQL with fsync=off with a number of NoSQL databases. The results were pretty good, but there might even be room for improvement with some additional tuning.

If you end up testing a configuration along these lines, please post a comment here or on the pgsql-performance mailing list with your experiences.

Wednesday, June 23, 2010

Working Toward PostgreSQL 9.0 Beta3

We are gradually creeping toward the release of PostgreSQL 9.0, but there's still a ways to go. We're continuing to get several bug reports per week about problems in 9.0 beta 2 - many thanks to all those who have tested and reported bugs. Here are some of the issues we've resolved in CVS since beta2:

- Fix a problem that could cause checkpoints to happen too infrequently when using streaming replication, with certain combinations of settings (Fujii Masao).
- Fix quoting problems in EXPLAIN (FORMAT YAML) output (Dean Rasheed).
- Fix a problem that could cause a "cache lookup failed for type %d" error when using PL/python (Tom Lane).
- Change pg_last_xlog_receive_location() and pg_last_xlog_replay_location() to return NULL instead of 0/0 when they do not apply (Heikki Linnakangas).
- Rename restartpoint_command to archive_cleanup_command, to clarify what it's for (Itagaki Takahiro).
- Allow replication connections to use a "replication" entry in .pgpass (Fujii Masao).
- Fix the newly added vacuumdb -Z option (Josh Berkus).
- Have pg_upgrade create its output files in a less surprising location (Bruce Momjian).
- Fix ALTER LARGE OBJECT and GRANT ... ON LARGE OBJECT to not break when an OID too large to be represented by a signed integer is used (Robert Haas).
- Avoid entering a tight loop when streaming replication hits a corrupt WAL record (Heikki Linnakangas).
- New contrib module for use as an archive_cleanup_command (Simon Riggs).
- Adjust GUC categories to match the docs (Itagaki Takahiro).
- Deprecate the use of => as an operator name, and remove or rename the new => operators in 9.0, so we can eventually use this for the purpose the SQL standards committee has in mind (Robert Haas).
- Revert apparently-useless code to add symbol table entries for PL/perl functions (Andrew Dunstan).
- Avoid calling malloc(0) in pg_upgrade (Bruce Momjian).
- Don't allow WAL to be streamed to the standby until it has been fsync'd on the master - otherwise, a master crash can effectively corrupt the standby database (Fujii Masao).
- Fix pg_upgrade problems on Win32 (Bruce Momjian).
- Various documentation improvements.

If you haven't yet beta-tested PostgreSQL 9.0, there's no time like the present! Ideally, load up your production data, run your production application against it, and let us know whether anything breaks. Or, pull the plug a few times and see if anything goes wrong; or try to break Streaming Replication or Hot Standby. This is shaping up to be a great release - but it's not done yet.

Sunday, June 13, 2010

Making PostgreSQL Faster

Although we've made great progress in speeding up PostgreSQL over the last few years, there's always more to be done. Performance, with PostgreSQL as with any other database, is largely determined by the availability of three resources: CPU, memory, and disk. What could we do to use each of these resources most efficiently?

PostgreSQL is already pretty efficient at using the CPU. For high-concurrency databases, I don't anticipate that things will get much better than they already are. For low-concurrency databases, we need parallel query - that is, the ability to use more than one CPU to process the same query.

Memory is a little bit more of a problem. We do a good job keeping our memory footprint small, but we don't manage it terribly well. work_mem limits the maximum size of a sort or hash, but takes no account of current conditions: if the system is swapping due to memory pressure, you get the same plan as if the system has 40GB of free memory. And all the memory allocated to shared_buffers remains allocated even when it isn't truly needed.

I/O is perhaps the biggest problem. I don't think this problem is unique to PostgreSQL - I believe all databases probably share this pain point to some degree. Disks are slow. With respect to PostgreSQL specifically, there are a number of things we need to do to minimize our I/O bandwidth, including index-only scans and further improvements to VACUUM. Partial vacuum (implemented in 8.4) is a pretty big deal, but there's more that needs to be done.

We also need to put more effort into minimizing our on-disk format and WAL volume. The actual disk space is cheap, but the time needed to read and write a larger volume of data hurts performance.

Tuesday, June 08, 2010

Why Join Removal Is Cool

When people talk to me about the (limited implementation of) join removal that will be part of PostgreSQL 9.0, the conversation usually goes in two ways. Some people ask how the feature works and then say something like "oh, I guess that could be useful every once in a while". Other people already know exactly how the feature works and usually say some variant of "this is an amazingly wonderful feature that I am looking forward to with great enthusiasm".

The difference between these two groups of people (I think) is not so much their level of technical knowledge or how closely they've been following pgsql-hackers, but their use case. If your database is primarily a data warehouse, my guess is that you won't have many occasions to benefit from join removal. Where this feature really comes in handy is in OLTP workloads with highly normalized data, in situations where users are generating queries against views (perhaps through some sort of reporting interface) and expecting to get results back immediately.

Let's take an example. Suppose you're writing a bug-tracking system. Each bug has a number of properties associated with it: who reported it, who's working on it, current status, priority, date opened, release for which it's slated to be fixed, date of last status change, date resolved, people who want to get an email when it's updated, comments, etc. If like me you're a big fan of database normalization, you'll not want to store all of these as text fields. So you might end up with a table like this:

id serial,
reporter_id integer not null references person (id),
assigned_to_id integer references person (id),
status_id integer not null references bug_status (id),
priority_id integer not null references priority (id),
target_release_id integer references release (id),
open_date date not null,
primary key (id)

You'll probably also end up with some supplementary tables for the items that can exist multiple times, like bug_comment and bug_watchers. Now, to make reporting easier, you'll probably want to define a view over the bug table that joins to all the other tables, so that it's easy to get the text values for the reporter, status, etc.

b.reporter_id, AS reporter,
b.assigned_to_id, AS assigned_to,
b.status_id, AS status,
b.priority_id, AS priority,
b.target_release_id, AS target_release,
bug b
JOIN person r ON b.reporter_id =
JOIN bug_status s ON b.status_id =
JOIN priority p ON b.priority_id =
LEFT JOIN person at ON b.assigned_to_id =
LEFT JOIN release tr ON b.target_release_id =;

And now you can pretty easily write an engine that will let users select the columns they'd like to see from bug_view and the filter conditions they'd like to apply (only open bugs, only bugs slated to be resolved in release X, etc.) via a spiffy web interface. Note that the reporter, bug status, and priority fields can't be null, so we can use a plain old JOIN, but the bug might be assigned to no one or have no target release, so we use LEFT JOIN in those cases. (Otherwise, rows where those fields were NULL would not appear in the output.)

Over time, you'll tend to add more fields. Scalar fields like open_date don't cause much heartache, but as you add more fields that require joins, your view will tend to slow down. Some people might say that the answer is simply to denormalize - use natural keys in the bug table, and don't join. While that solution may be appropriate for some people, it is not without its downsides: database normalization was invented for a reason. The good news is that PostgreSQL is fast and has an excellent query planner, so even fairly complex queries run quite quickly. The bad news is that every query against the view is going to hit every table that's part of the view definition, so if you add enough of them, it's eventually going to be slow.

And, realistically, most of the time, users aren't going to want all the columns anyway. In a web application, 8-12 columns of output in an HTML table is typically about as much as you can squeeze in without starting to have a lot of line wrapping issues. This leads pretty naturally to the following question: if you don't need all of the columns, can you skip some of those joins and speed up the query?

Yes. In PostgreSQL 9.0, we can drop a join against a base table if (1) it's a left join, (2) there is a unique index on all or a subset of the join columns, and (3) none of the attributes from the nullable side of the join are used elsewhere in the query. So, in the above example, we could skip the joins to person at or release tr if the assigned_to or target_release columns, respectively, are not selected, assuming those tables have unique indexes on their id columns (if they don't, the join might change the number of rows in the output, so we must perform it).

We can't skip joining to any of the other tables, because those are inner joins. That's an implementation restriction which I hope will be lifted in PostgreSQL 9.1, but some more logic is needed to make that safe. In the meantime, a useful workaround may be to write those joins as LEFT JOINs rather the INNER JOINs, in cases where either join type will produce the same results.

Monday, May 24, 2010

An Excellent Developer Meeting

I'm really pretty fired up about the results of our PGCon 2010 Developer Meeting. Of course, the list of what everyone plans to work on is pretty interesting, and if we got even a fraction of those features we'd have a pretty awesome release. But that's not really what got me fired up. What I'm excited about is some of the new and innovative thinking on replication and clustering - or, at any rate, it was new to me.

Two concepts in particular stand out for me. First, we discussed the ability to give replication solutions a crash-recoverable view into transaction commit order, a point which Jan Wieck has since expanded into a full-blown design proposal. Jan believes that this infrastructure will be useful not only for his own project, Slony, but also for other replication solutions such as Londiste which also operate by replaying transactions in commit order. As I understand it, one of the major advantages of this approach is that it eliminates the need for a global shared counter to track the order of writes (which becomes a bottleneck). Instead, they can be tagged with their order within the top-level transaction, and then the transactions as a whole can be ordered using the transaction commit ordering information.

Second, I was very interested in our discussion of a global transaction manager, for which I unfortunately do not have a good link for further reading. One possible way of avoiding cross-node serialization anomalies in a distributed database environment is to have a single node which knows about all in-flight transactions and hands out snapshots that are coherent across the entire cluster. Postgres-XC takes this approach, but there might be value in integrating something like this into core PostgreSQL. We might imagine allowing one PostgreSQL instance to be configured as a "snapshot provider" and another instance to subscribe to it. Right now, it's not clear that there's enough benefit to core PostgreSQL from accepting a patch along these lines, but there are several ways that might change as our distributed computing capabilities improve. For example, if we had a significant SQL/MED implementation, we'd need to think about how to do serialization correctly across multiple nodes; there might also be applications as we work to expand the capabilities of Hot Standby.

If your eyes are glazing over at this point, you're probably not alone. These features are fairly esoteric. Still, I think the fact that we're starting to seriously talk about this topics and consider integrating some of them into core shows that we're starting to understand better what the real needs are for replication and clustering. As our understanding of those needs continues to improve, I expect to see more capabilities in core PostgreSQL, but perhaps even more importantly, an even stronger set of tools around core PostgreSQL that will make it progressively easier to scale horizontally. I don't expect this to happen overnight, but I feel like we're moving in the right direction.

Thursday, May 20, 2010

Global Temporary and Unlogged Tables

From a technical standpoint, temporary tables in PostgreSQL have three properties that distinguish them from ordinary tables:

1. They're stored in a special schema, so that they are normally visible only to the creating backend.
2. They are managed by the local buffer manager rather than the shared buffer manager.
3. They are not WAL-logged.

It makes sense to think about removing these properties one by one, in the order listed above. Removing just the first property, without doing anything else, doesn't quite make sense, because a table which is managed by the local buffer manager can't be simultaneously accessed by multiple backends. We could work around this by having each backend access a separate set of files. This would give us a global temporary table - that is, a table which is visible to everyone, but each backend sees its own contents. (There is some debate about whether this is the right name, or what the right name for this concept might be - but that's what I'm calling it for now.)

Removing both of the first two properties also makes sense. It gives us an unlogged table - that is, a basically ordinary table for which no WAL is written. (Again, the naming is debatable.) Such tables are not crash-safe: an unexpected system crash could leave the table hopelessly corrupted. The only obvious workaround for this problem is to truncate the table on every system restart.

Why might someone want these new table types? Global temporary tables are appealing for users who need temporary tables with a relatively fixed structure, and don't want to recreate them in every new session. In addition to administrative convenience, this avoids the overhead of repeatedly creating and vacuuming the system catalog entries associated with the temporary tables, which may be a performance benefit for some users.

Unlogged tables are appealing for data that needs to be shared across backends, but which we're willing to lose in the case of a server restart. For example, consider a web application maintaining a table of active user sessions. If the server restarts, we may be willing to lose this data. Everyone will need to log in again, but considering that database crashes are rare, that may not be such a big deal. Unlogged tables also won't be replicated to standby servers, since replication relies on WAL. But, on the plus side, skipping WAL-logging should hopefully yield a significant performance benefit.

I'm going to be working on implementing both of these table types for PostgreSQL 9.1. In each case, the hardest part seems to be making sure that we clean up properly after a crash or server restart.

Monday, May 10, 2010

Lots and Lots of PostgreSQL Feature Requests

I was surprised and pleased to see that my last blog post, concerning possible directions for future PostgreSQL development, got about five times as many page views as my previous five posts put together, and a total of 70 comments (to date). This may be partly because it got syndicated on LWN, where a few more comments were also posted. I've gone through the comments posted on the blog itself and on the LWN article and counted up the number of times each feature was mentioned. Of course, this is totally unscientific, but it matches up fairly well to the results of previous surveys and gives me an excuse to talk about a bunch of interesting features.

1. Materialized Views (12). See my previous post on Materialized Views in PostgreSQL. Long story short, we may quite possibly get a simple version of this in PostgreSQL 9.1, but I suspect a lot more work will be needed to meet some of the use cases people have in mind.

2. Multi-master replication (6). This is more or less the holy grail of database geeks; it's a pretty hard problem. As it turns out, there is a project in the works called Postgres-XC which does just this. I hope to learn more about this project next week at PGcon. My understanding is that it currently supports only a subset of the SQL query types supported by PostgreSQL, but that work is underway to remove these limitations. Currently, none of the Postgres-XC code can be considered for including in core PostgreSQL because it uses a different license (LGPL), but it's still very interesting as an independent project.

3. Index-organized tables and/or index-only scans and/or automatic maintenance of CLUSTER order (6). I've grouped all of these features together because they're really driving toward the same underlying goal: reducing the I/O cost of an index scan. PostgreSQL will most likely not implement index-organized tables in the sense that Oracle has them, wherein, as I understand it, the table data is stored in the leaf pages of an index. However, we probably will implement index-only scans, which will allow us to gain some of the same performance benefits. Automatic maintenance of CLUSTER order would help, too, but I am not aware that anyone is currently working on that project.

4. MERGE (6). There is a (very ambitious) Google Summer of Code project to implement this feature. Stay tuned. If you're curious about what this feature actually does, I think Simon Riggs has written the best description of MERGE I've seen so far, together with some discussion of the implementation issues. A few weeks later he discussed he followed up with some further notes on the design of the feature.

5. Partitioning syntax (5). Itagaki Takahiro proposed a patch to implement this for PostgreSQL 9.0, but we simply ran out of time. I am hopeful that this will come back to life and be improved and eventually committed for PostgreSQL 9.1. The patch as proposed would have allowed the use of dedicated syntax to specify partitions and would have automatically created appropriate CHECK constraints for each partition, but would not have automatically routed INSERTs to the proper partition, which I think is necessary to make this really useful. Of course, others may have different opinions. :-)

6. Parallel query execution (5). This is a very good idea and yet also a very hard problem; I am not aware that anyone has even proposed a possible design for this, yet alone attempted to implement it. If we implement better SQL/MED support, it might be possible to get some of the benefits of this feature by spreading out data across multiple nodes but making it all appear local by creating remote tables. Or, it might be possible to leverage some of the I/O bandwidth of remote nodes by adding a feature to support non-local tablespaces (with some kind of specialized daemon process reading and writing remote pages on request). But neither of these are spot-on: what we really want is the ability to parallelize a query on a single node.

7. Procedures that can BEGIN, COMMIT, or ABORT transactions (4). This is another feature that would be great to have, but I am not aware that anyone is currently working on it.

8. Granular collation support (4). There are really two halves to this project. The SQL standard specifies a set of complex rules for determining which collation should be used for a particular comparison or ORDER BY operation. In PostgreSQL, you could imagine setting the collation for a particular setting using the "SET" command; associating a collation with a particular column; or overriding the collation for a particular instance of ORDER BY or a particular use of the < operator. So, one half of this problem is simply being able to recognize which collation applies in a particular context and doing the sort or comparison under that collation. The other half of the problem is extending our indexing system to handle multiple collations - either the ability to create an index with a particular collation (which can then be used to satisfy queries that pertain to that collation) or even the ability to create a single index which can somehow answer queries pertaining to multiple collations.

9. Better drivers (4). JDBC was mentioned twice, and asynchronous drivers were mentioned twice. It was also suggested that we should have an "official" Python driver. As a project, we've generally been wary about endorsing other projects, perhaps to our detriment. But good things seem to be happening with the psycopg2 project, especially the recent license change.

10. Graph database features (3). At least one person observed that you can do some of what is wanted here using common table expressions, also known as WITH queries, which are supported beginning in PostgreSQL 8.4. But several people seem to feel that we should have more graph support; one poster mentioned algorithms such as all-pairs shortest paths. I don't have a clear idea of what is needed here, but I suspect that some of the things people are looking for here could be implemented as PostgreSQL extensions. It would be interesting to see some more detailed requirements.

11. Documentation (3). I was surprised to see several requests for documentation among the comments, since I have generally found the PostgreSQL documentation to be superb and one of the great strengths of the project. But one poster did hit on an issue which I think is entirely legitimate: if you Google something like "PostgreSQL documentation", you get a link to our main documentation. But if you Google "PostgreSQL ALTER TABLE", you get a link to the documentation for ALTER TABLE in PostgreSQL 8.1, whereas you might hope to get a link to the 8.4 version of the documentation for that command. And if you Google "PostgreSQL setting", well, let's just say you don't get a link to a page that tells you how to change PostgreSQL settings. If you actually go to the documentation page and navigate through it manually, it's quite easy to find what you're looking for, but there must be something about our site that make Google fail to grok it properly.

Still another poster was looking for documentation in Chinese. Currently, I believe that we maintain documentation only in English due to the rather large translation effort that would be involved in keeping documentation up to date in multiple languages. In fact, we currently don't even ship Chinese translations of our error messages, due to the fact that our existing set of translations is too incomplete. If you would like to help localize PostgreSQL for your native language, please see our wiki page on NLS. Volunteers are needed!