Robert Haas

Tuesday, August 24, 2010

Version Numbering

Over the last few days, there's been a debate raging on pgsql-hackers on the subject of version numbering. There are many thoughtful (and some less-thoughtful) opinions on the thread that you may wish to read, but I thought the most interesting was a link posted by Thom Brown to a blog post called The Golden Rules of Version Naming. If you haven't seen it, it's definitely worth a read.

Monday, August 16, 2010

Why We're Conservative With PostgreSQL Minor Releases

Last week, a PostgreSQL user filed bug #5611, complaining about a performance regression in PostgreSQL 8.4 as compared with PostgreSQL 8.2. The regression occurred because PostgreSQL 8.4 is capable of inlining SQL functions, while PostgreSQL 8.2 is not. The bug report was also surprising to me, because in my experience, inlining SQL queries has always improved performance, often dramatically. But this user managed unluckily hit a case where the opposite is true: inlining caused a function which had previously been evaluated just once to be evaluated multiple times. Fortunately, there is an easy workaround: writing the function using the "plpgsql" language rather than the "sql" language defeats inlining.

Although the bug itself is interesting (let's face it, I'm a geek), what I found even more interesting was that I totally failed to appreciate the possibility that inlining an SQL function could ever fail to be a performance win. Prior to last week, if someone had asked me whether that was possible, I would have said that I didn't think so, but you never know...

And that is why the PostgreSQL project maintains stable branches for each of our major releases for about five years. Stable branches don't get new features; they don't get performance enhancements; they don't even get tweaks for things we wish we'd done differently or corrections to behavior of doubtful utility. What they do get is fixes for bugs (like: without this fix, your data might get corrupted; or, without this fix, the database might crash), security issues, and a smattering of documentation and translation updates. When we release a new major release (or actually about six months prior to when we actually release), development on that major release is over. Any further changes go into the next release.

On the other hand, we don't abandon our releases once they're out the door, either. We are just now in the process of ceasing to support PostgreSQL 7.4, which was released in November 2003. For nearly seven years, any serious bugs or security vulnerabilities which we have discovered either in that version or any newer version have been addressed by releasing a new version of PostgreSQL 7.4; the current release is 7.4.29. Absent a change in project policy, 7.4.30 will be the last 7.4.x release.

If you're running PostgreSQL 8.3 or older, and particularly if you're running PostgreSQL 8.2 or older, you should consider an upgrade, especially once PostgreSQL 9.0 comes out. Each release of PostgreSQL includes many exciting new features: new SQL constructions, sometimes new data types or built-in functions, and performance and manageability enhancements. Of course, before you upgrade to PostgreSQL 8.4 (or 9.0), you should carefully test your application to make sure that everything still works as you expect. For the most part, things tend to go pretty smoothly, but as bug #5611 demonstrates, not always.

Of course, this upgrade path is not for everyone. Application retesting can be difficult and time-consuming, especially for large installations. There is nothing wrong with staying on the major release of PostgreSQL that you are currently using. But it is very wise to upgrade regularly to the latest minor version available for that release. The upgrade process is generally as simple as installing the new binaries and restarting the server (but see the release notes for your version for details), and the PostgreSQL community is firmly committed to making sure that each of these releases represents an improvement to performance and stability rather than a step backwards.

Friday, August 13, 2010

How I Hack on PostgreSQL

Today's post by Dimitri Fontaine gave me the idea of writing a blog posting about the tools I use for PostgreSQL development. I'm not saying that what I do is the best way of doing it (and it's certainly not the only way of doing it), but it's one way of doing it, and I've had good luck with it.

What commands do I use? The following list shows the ten commands that occur most frequently in my shell history.

[rhaas pgsql]$ history  | awk '{print $2}' | sort | uniq -c | sort -rn | head
 250 git
  57 vi
  31 %%
  25 cd
  24 less
  20 up
  18 make
  13 pg_ctl
  10 ls
   8 psql

Wow, that's a lot of git. I didn't realize that approximately half of all the commands I type are git commands. Let's see some more details.

[rhaas pgsql]$ history  | awk '$2 == "git" { print $3}' | sort | uniq -c | sort -rn | head
  93 diff
  91 grep
  15 log
  10 commit
   8 checkout
   7 add
   6 reset
   5 clean
   4 pull
   3 branch

As you can see, I use git diff and git grep far more often than any other commands. The most common things I do with git diff are just plain git diff, which displays the unstaged changes in my working tree (so I can see what I've changed, or what a patch I've just applied has changed) and git diff master (which shows all the differences between my working tree and the master branch; this is because I frequently use git branches to hack on a patch I'm working on). A great deal of the work of writing a good patch - or reviewing one - consists in looking at the code over and over again and thinking about whether every change can be justified and proven correct.

git grep does a recursive grep starting at the current directory, but only examines files checked into git (not build products, for example). I use this as a way to find where a certain function is defined (by grepping for the name of the function at the start of a line) and as a way to find all occurrences of an identifier in the code (which is an absolutely essential step in verifying the correctness of your own patch, or someone else's).

As you can also see, my preferred editor is vi (really vim). This might not be the best choice for everyone, but I've been using it for close to 20 years, so it's probably too late to learn something else now. I think Dimitri Fontaine said it well in the post linked above: the best editor you can find is the one you master. Having said that, if you do even a small amount of programming, you're likely to spend a lot of time in whatever editor you pick, so it's probably worth the time it takes to learn a reasonably powerful one.

Saturday, August 07, 2010

Git is Coming to PostgreSQL

As discussed at the PGCon 2010 Developer Meeting, PostgreSQL is scheduled to adopt git as its version control system some time in the next few weeks. Andrew Dunstan, who maintains the PostgreSQL build farm, has adapted the build farm code to work with either CVS or git; meanwhile, Magnus Hagander has done a trial conversion so that we can all see what the new repository will look like. My small contribution was to write some documentation for the PostgreSQL committers, which has subsequently been further edited by Heikki Linnakangas (the link here is to his personal web page, whose one complete sentence is one of the funnier things I've read on the Internet).

I don't think the move to git is going to be radical change; indeed, we're taking some pains to make sure that it isn't. But it will make my life easier in several small ways. First, the existing git clone of the PostgreSQL CVS repository is flaky and unreliable. The back-branches have had severe problems in this area for some time (some don't build), and the master branch (aka CVS HEAD) has periodic issues as well. At present, for example, the regression tests for contrib/dblink fail on a build from git, but pass on a build from CVS. While we might be able to fix (or minimize) these issues by fixing bugs in the conversion code, switching to git should eliminate them. Also, since I do my day-to-day PostgreSQL work using git, it will be nice to be able to commit that way also - it should be both faster (CVS is very slow by comparison) and less error-prone (no cutting and pasting the commit message, no forgetting to add a file in CVS that you already added in git).

Thursday, July 29, 2010

Multi-Tenancy and Virtualization

In a recent blog post on Gigaom, Simeon Simeonov argues that virtualization is on the way out, and discusses VMware's move toward platform-as-a-service computing. In a nutshell, his argument is that virtualization is inefficient, and is essentially a last resort when legacy applications can't play nicely together in the same sandbox. In other words, the real goal for IT shops and service providers is not virtualization per se, but multi-tenancy, cost-effective use of hardware, and high availability. Find any two servers in the average corporate data center and ask why they're not running on the same machine. It's a good bet you'll get one of the following four answers: (1) machine A is running a piece of software that misbehaves if run on the same machine as some piece of software running on machine B, (2) a single server couldn't handle the load, (3) one of those servers provides redundancy for the other, or (4) no particular reason, but we haven't gotten around to consolidating them yet. In my experience, the first answer is probably the most common. But as Simeonov points out, the ideal solution is not virtualization, but better software - specifically, platforms that can transparently service multiple customers.

PostgreSQL is very strong in this area. Hosting providers such as hub.org provision databases for multiple customers onto a single PostgreSQL instance; and here at EnterpriseDB, we support several customers who do much the same thing. Databases in PostgreSQL provide a high degree of isolation: many configuration parameters can be set on a per-database basis, extensions can be installed into a single database without affecting other databases that are part of the same instance, and each database can in turn contain multiple schemas. The ability to have multiple databases, each containing multiple schemas, makes the PostgreSQL model more flexible than Oracle or MySQL, which have only a single tier system. In the upcoming PostgreSQL 9.0 release, the new grant on all in schema and default privileges features will further simplify user administration in multi-user and multi-tenant environments. Behind the scenes, a PostgreSQL instance uses a single buffer pool which can be efficiently shared among any number of databases without excessive lock contention. This is critical. Fragmenting memory into many small buffer pools prevents databases from scaling up (using more memory) when under heavy load, and at the same time prevents databases from scaling down (using less memory) when not in use. By managing all databases out of a single pool, PostgreSQL can allow a single database to use every block in the buffer pool - if no other databases are in use - or no blocks at all - if the database is completely idle.

Simeonov seems to feel that virtualization has already nearly run its course, and predicts that the market will hit its peak within three years. That doesn't seem likely to me. I think there is an awful lot of crufty hardware and software out there that could benefit from virtualization, but it's working right now, so no one is eager to make changes that might break something. As the physical equipment starts to fail, IT administrators will think about virtualization, but hardware that isn't touched can sometimes run for a surprisingly long time, so I don't expect server consolidation projects to disappear any time soon. More importantly, Simeonov seems to assume that all new applications will be developed using platform-as-a-service architectures such as Google App Engine, Bungee, Engine Yard, and Heroku. While some certainly will be, it seems unlikely that the traditional model of application development, using a dedicated web server and a dedicated database running on a physical or virtual machine will disappear overnight. For one thing, choosing one of those vendors means being locked into that vendor's API - and choice of programming language. Bungee and Heroku are Ruby environments, for example, while Google App Engine offers Java and Python. Good luck making the switch!

So, if plain old virtual machines are going to be around for a while, how does PostgreSQL stack up in that environment? Not too bad. Of course, write-intensive workloads will suffer from the generalized slowness of virtualized I/O. But PostgreSQL is designed to run well even in a very small memory footprint, to take good advantage of the OS buffer cache and process scheduler, and to be portable across a wide variety of platforms. If your database is small enough to fit in memory, performance should be good. And if your database isn't small enough to fit in memory, there's not much point in virtualizing it: you're going to need a dedicated machine either way.

Sunday, July 25, 2010

Google and our Documentation

As I mentioned in a previous blog post, trying to find pages in the PostgreSQL documentation using Google doesn't work very well: most often, one gets links to older versions.

A recent thread on pgsql-performance (somewhat off-topic for that mailing list, but that's where it was) suggested that perhaps we could use Google's canonical URL feature to work around this problem.

Another suggestion was that we ask people who link to our docs to link to http://postgresql.org/docs/current/ (or some sub-page) rather than linking to a specific version (e.g. the same URL with 8.4 in place of current). That way, as new versions come out, everyone's links will still be pointing at the latest version of the docs, helping the new versions accumulate "Google karma" more quickly than they would otherwise. Or at least, that's the idea: I have no idea whether it would actually work.

Thursday, July 22, 2010

Best Patches of 9.1CF1

Although PostgreSQL 9.0 isn't out yet, we began the first CommitFest for PostgreSQL 9.1 development on July 15, 2010. Our goal is to review every patch submitted by then before August 15. While we're only a week into the CommitFest, I already have some favorite patches: none of which are committed yet, so they might die, get withdrawn, changed, etc. But here they my top picks.

1. Simon Riggs wrote a very nice patch to reduce the lock level required for various DDL statements. We haven't yet come up with clearly workable ideas for allowing multiple DDL statements to execute on the same table at the same time, but what this patch will do is allow certain DDL commands to run in parallel with DML. Some versions of ALTER TABLE will lock out everything (as they all do, presently), some will lock out INSERT/UPDATE/DELETE/VACUUM statements but allow SELECT to run in parallel, and some will only lock out concurrent DDL and VACUUM operations (like ALTER TABLE ... SET WITHOUT CLUSTER). This should be really nice for anyone administering a busy database.

2. My employer, EnterpriseDB, asked me to write a patch that reduces the size of most numeric values on disk. This was based on a design proposal from Tom Lane a few years ago, and turned out to be pretty simple to code up. Currently, our numeric data type always has a 4-byte header specifying the weight of the first digit and display scale. For the values people typically do store, that's overkill. This patch allows a 2-byte header to be used opportunistically, when we can cram everything in; but the old format can still be understood, so it doesn't break pg_upgrade. It'll be interesting to see whether people can see a noticeable change in the size of their disk footprint when this patch is used. And maybe we could even get by with a 1-byte header sometimes... but that's a thought for another patch.

3. Kevin Grittner posted a patch to implement true serializability. I haven't studied the code in detail, and I'm not sure how soon we can hope to see this committed, but it's pretty cool. Our current serialization techniques are pretty good, but this should be a whole lot better whose application logic relies heavily on the absence of serialization anomalies.