In a recent blog post on Gigaom, Simeon Simeonov argues that virtualization is on the way out, and discusses VMware's move toward platform-as-a-service computing. In a nutshell, his argument is that virtualization is inefficient, and is essentially a last resort when legacy applications can't play nicely together in the same sandbox. In other words, the real goal for IT shops and service providers is not virtualization per se, but multi-tenancy, cost-effective use of hardware, and high availability. Find any two servers in the average corporate data center and ask why they're not running on the same machine. It's a good bet you'll get one of the following four answers: (1) machine A is running a piece of software that misbehaves if run on the same machine as some piece of software running on machine B, (2) a single server couldn't handle the load, (3) one of those servers provides redundancy for the other, or (4) no particular reason, but we haven't gotten around to consolidating them yet. In my experience, the first answer is probably the most common. But as Simeonov points out, the ideal solution is not virtualization, but better software - specifically, platforms that can transparently service multiple customers.
PostgreSQL is very strong in this area. Hosting providers such as hub.org provision databases for multiple customers onto a single PostgreSQL instance; and here at EnterpriseDB, we support several customers who do much the same thing. Databases in PostgreSQL provide a high degree of isolation: many configuration parameters can be set on a per-database basis, extensions can be installed into a single database without affecting other databases that are part of the same instance, and each database can in turn contain multiple schemas. The ability to have multiple databases, each containing multiple schemas, makes the PostgreSQL model more flexible than Oracle or MySQL, which have only a single tier system. In the upcoming PostgreSQL 9.0 release, the new grant on all in schema and default privileges features will further simplify user administration in multi-user and multi-tenant environments. Behind the scenes, a PostgreSQL instance uses a single buffer pool which can be efficiently shared among any number of databases without excessive lock contention. This is critical. Fragmenting memory into many small buffer pools prevents databases from scaling up (using more memory) when under heavy load, and at the same time prevents databases from scaling down (using less memory) when not in use. By managing all databases out of a single pool, PostgreSQL can allow a single database to use every block in the buffer pool - if no other databases are in use - or no blocks at all - if the database is completely idle.
Simeonov seems to feel that virtualization has already nearly run its course, and predicts that the market will hit its peak within three years. That doesn't seem likely to me. I think there is an awful lot of crufty hardware and software out there that could benefit from virtualization, but it's working right now, so no one is eager to make changes that might break something. As the physical equipment starts to fail, IT administrators will think about virtualization, but hardware that isn't touched can sometimes run for a surprisingly long time, so I don't expect server consolidation projects to disappear any time soon. More importantly, Simeonov seems to assume that all new applications will be developed using platform-as-a-service architectures such as Google App Engine, Bungee, Engine Yard, and Heroku. While some certainly will be, it seems unlikely that the traditional model of application development, using a dedicated web server and a dedicated database running on a physical or virtual machine will disappear overnight. For one thing, choosing one of those vendors means being locked into that vendor's API - and choice of programming language. Bungee and Heroku are Ruby environments, for example, while Google App Engine offers Java and Python. Good luck making the switch!
So, if plain old virtual machines are going to be around for a while, how does PostgreSQL stack up in that environment? Not too bad. Of course, write-intensive workloads will suffer from the generalized slowness of virtualized I/O. But PostgreSQL is designed to run well even in a very small memory footprint, to take good advantage of the OS buffer cache and process scheduler, and to be portable across a wide variety of platforms. If your database is small enough to fit in memory, performance should be good. And if your database isn't small enough to fit in memory, there's not much point in virtualizing it: you're going to need a dedicated machine either way.
Thursday, July 29, 2010
Sunday, July 25, 2010
Google and our Documentation
As I mentioned in a previous blog post, trying to find pages in the PostgreSQL documentation using Google doesn't work very well: most often, one gets links to older versions.
A recent thread on pgsql-performance (somewhat off-topic for that mailing list, but that's where it was) suggested that perhaps we could use Google's canonical URL feature to work around this problem.
Another suggestion was that we ask people who link to our docs to link to http://postgresql.org/docs/current/ (or some sub-page) rather than linking to a specific version (e.g. the same URL with 8.4 in place of current). That way, as new versions come out, everyone's links will still be pointing at the latest version of the docs, helping the new versions accumulate "Google karma" more quickly than they would otherwise. Or at least, that's the idea: I have no idea whether it would actually work.
A recent thread on pgsql-performance (somewhat off-topic for that mailing list, but that's where it was) suggested that perhaps we could use Google's canonical URL feature to work around this problem.
Another suggestion was that we ask people who link to our docs to link to http://postgresql.org/docs/current/ (or some sub-page) rather than linking to a specific version (e.g. the same URL with 8.4 in place of current). That way, as new versions come out, everyone's links will still be pointing at the latest version of the docs, helping the new versions accumulate "Google karma" more quickly than they would otherwise. Or at least, that's the idea: I have no idea whether it would actually work.
Thursday, July 22, 2010
Best Patches of 9.1CF1
Although PostgreSQL 9.0 isn't out yet, we began the first CommitFest for PostgreSQL 9.1 development on July 15, 2010. Our goal is to review every patch submitted by then before August 15. While we're only a week into the CommitFest, I already have some favorite patches: none of which are committed yet, so they might die, get withdrawn, changed, etc. But here they my top picks.
1. Simon Riggs wrote a very nice patch to reduce the lock level required for various DDL statements. We haven't yet come up with clearly workable ideas for allowing multiple DDL statements to execute on the same table at the same time, but what this patch will do is allow certain DDL commands to run in parallel with DML. Some versions of ALTER TABLE will lock out everything (as they all do, presently), some will lock out INSERT/UPDATE/DELETE/VACUUM statements but allow SELECT to run in parallel, and some will only lock out concurrent DDL and VACUUM operations (like ALTER TABLE ... SET WITHOUT CLUSTER). This should be really nice for anyone administering a busy database.
2. My employer, EnterpriseDB, asked me to write a patch that reduces the size of most numeric values on disk. This was based on a design proposal from Tom Lane a few years ago, and turned out to be pretty simple to code up. Currently, our numeric data type always has a 4-byte header specifying the weight of the first digit and display scale. For the values people typically do store, that's overkill. This patch allows a 2-byte header to be used opportunistically, when we can cram everything in; but the old format can still be understood, so it doesn't break pg_upgrade. It'll be interesting to see whether people can see a noticeable change in the size of their disk footprint when this patch is used. And maybe we could even get by with a 1-byte header sometimes... but that's a thought for another patch.
3. Kevin Grittner posted a patch to implement true serializability. I haven't studied the code in detail, and I'm not sure how soon we can hope to see this committed, but it's pretty cool. Our current serialization techniques are pretty good, but this should be a whole lot better whose application logic relies heavily on the absence of serialization anomalies.
1. Simon Riggs wrote a very nice patch to reduce the lock level required for various DDL statements. We haven't yet come up with clearly workable ideas for allowing multiple DDL statements to execute on the same table at the same time, but what this patch will do is allow certain DDL commands to run in parallel with DML. Some versions of ALTER TABLE will lock out everything (as they all do, presently), some will lock out INSERT/UPDATE/DELETE/VACUUM statements but allow SELECT to run in parallel, and some will only lock out concurrent DDL and VACUUM operations (like ALTER TABLE ... SET WITHOUT CLUSTER). This should be really nice for anyone administering a busy database.
2. My employer, EnterpriseDB, asked me to write a patch that reduces the size of most numeric values on disk. This was based on a design proposal from Tom Lane a few years ago, and turned out to be pretty simple to code up. Currently, our numeric data type always has a 4-byte header specifying the weight of the first digit and display scale. For the values people typically do store, that's overkill. This patch allows a 2-byte header to be used opportunistically, when we can cram everything in; but the old format can still be understood, so it doesn't break pg_upgrade. It'll be interesting to see whether people can see a noticeable change in the size of their disk footprint when this patch is used. And maybe we could even get by with a 1-byte header sometimes... but that's a thought for another patch.
3. Kevin Grittner posted a patch to implement true serializability. I haven't studied the code in detail, and I'm not sure how soon we can hope to see this committed, but it's pretty cool. Our current serialization techniques are pretty good, but this should be a whole lot better whose application logic relies heavily on the absence of serialization anomalies.
Wednesday, July 07, 2010
Distributed Serialization Anomalies
One of the more difficult responsibilities of a database is to provide you with the illusion that transactions on the system are executed sequentially, one after another, while in fact allowing as much parallelism as possible. PostgreSQL's MVCC implementation does this using "snapshots": each statement (or, if you choose the serializable isolation level, each transaction), upon first access to the database, records which transactions have committed as of that moment, and everything it does afterwards will see the effect of those transactions, but not any transactions committed later. (There are some exceptions to this rule when using READ COMMITTED mode with INSERT, UPDATE, or DELETE statements.)
This produces, more or less, the illusion that SQL statements execute sequentially, with each one completing its work before the next one begins. This illusion is extremely important and valuable in many real-world applications. For example, if you transfer money from your checking account to your savings account, a banking application might insert two new rows into the "banking_transactions" table: one to show the debit from checking, and another to show the credit to savings. It wouldn't be good if some other query saw just one of these two new rows: it would look as if the money disappeared from checking but did not appear in savings, or perhaps as if it had appeared in savings without disappearing from checking. You'd be unhappy about the first scenario, and the bank would be unhappy about the second one. This type of scenario is called a serialization anomaly, and databases are responsible for preventing them. In this case, it's pretty easy to make sure this problem can't happen: just do both inserts within a single transaction, and then commit it.
Things get a little trickier when there's more than one database involved. Suppose that I'm moving money from my account (at one branch) to my friend Magnus's account (at a different branch of the same bank). As before, we must make two transaction entries: one showing the debit to my account, and the other showing the credit to his account. We can start transactions on both nodes and do the inserts, but it's not possible to commit both transactions at the very same instant: there could always be a crash after one transaction commits, but before the other one commits.
We can work around this problem to some extent using a protocol called two-phase commit: we'll issue a "PREPARE TRANSACTION" command in both transactions, which should be enough to guarantee that a subsequent "COMMIT PREPARED" command, even after an intervening crash, has no chance of failure. So, we start a transaction on each database, do an insert on each database, prepare both transactions, and then commit both transactions. If there's a crash (or loss of connectivity) after either transaction is prepared but before both transactions are committed, we can still get things back to a consistent state once things are back up again. How? We look to see if either transaction committed; if so, we commit the other one. If not, we see whether both transactions were succesfully prepared; if so, we can commit or abort both; if not, we must abort both.
This solves the problem of making sure that no money can be permanently lost (or created), but there will still be a period of time during which we can see inconsistent views of the system as a whole. Imagine that the bank auditor comes along and runs a report across all bank branches adding up the bank's assets and liabilities. It's possible that he'll query one of the two databases involved in our hypothetical funds transfer before the transaction commits on that node, but by the time he visits the other one, it's committed - therefore he'll see the transferred funds either in both accounts, or in neither one, depending on the order in which he hits the different branches. This is a distributed serialization anomaly.
Distributed serialization anomalies are much harder to avoid than regular serialization anomalies (which are a hard problem all by themselves). One method - which is used by Postgres-XC - is to have a single authority (which Postgres-XC calls a global transaction manager) which hands out snapshots and transaction IDs across all nodes in the cluster; regrettably, there is a potential for this to become a bottleneck, or a single point of failure (see Postgres-XC_Write-Scalable_Cluster.pptx, slides 10 and following).
Unfortunately, there may not be many good alternatives. There is a technology called commitment ordering which seems to have a long paper trail[1] in the academic literature, and which has been studied in relation to MVCC. The good news is that commitment ordering does not require a global coordinator of any kind; each node operates independently and does not even need to know the identities of the other nodes, or even how many exist. It requires no additional communication of any kind. The bad news is that it operates by aborting potentially problematic transactions, and it might end up aborting quite a lot of them. The rule is simply that the serialization order must match the commit order; so if transaction A reads x and writes y, transaction B reads y; and then transaction A commits, the system will abort B (because there could be a read-write dependency cycle between A and B involving another database).
Another alternative is to build up a commit-order dependency graph that spans all the databases involved in the transaction. That is, we imagine a graph with each unaborted transaction as a vertex. If A reads or updates a row and B subsequently updates it, we add an edge from A to B. If A updates a row and B subsequently reads the updated version (or a later version), we also add an edge from A to B. If, at any time, adding an edge to the graph would create a cycle, we abort one of the constituent transactions. Kevin Grittner and Emmanuel Cecchet pointed out a paper by Michael Cahill on this topic[2]; one of the advantages of this approach is that it is possible to prevent all serialization anomalies, which our current approach does not. Kevin and Dan Ports have proposed a patch for 9.1 which would implement true serializability for a single PostgreSQL database, but it's not clear that this would scale well to a distributed system.
[1] e.g. The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource-Managers, Yoav Raz, 1990 [PDF].
[2] Serializable Isolation for Snapshot Databases, Michael J. Cahill, Uwe Röhm, Alan D. Fekete, 2006 [PDF].
This produces, more or less, the illusion that SQL statements execute sequentially, with each one completing its work before the next one begins. This illusion is extremely important and valuable in many real-world applications. For example, if you transfer money from your checking account to your savings account, a banking application might insert two new rows into the "banking_transactions" table: one to show the debit from checking, and another to show the credit to savings. It wouldn't be good if some other query saw just one of these two new rows: it would look as if the money disappeared from checking but did not appear in savings, or perhaps as if it had appeared in savings without disappearing from checking. You'd be unhappy about the first scenario, and the bank would be unhappy about the second one. This type of scenario is called a serialization anomaly, and databases are responsible for preventing them. In this case, it's pretty easy to make sure this problem can't happen: just do both inserts within a single transaction, and then commit it.
Things get a little trickier when there's more than one database involved. Suppose that I'm moving money from my account (at one branch) to my friend Magnus's account (at a different branch of the same bank). As before, we must make two transaction entries: one showing the debit to my account, and the other showing the credit to his account. We can start transactions on both nodes and do the inserts, but it's not possible to commit both transactions at the very same instant: there could always be a crash after one transaction commits, but before the other one commits.
We can work around this problem to some extent using a protocol called two-phase commit: we'll issue a "PREPARE TRANSACTION" command in both transactions, which should be enough to guarantee that a subsequent "COMMIT PREPARED" command, even after an intervening crash, has no chance of failure. So, we start a transaction on each database, do an insert on each database, prepare both transactions, and then commit both transactions. If there's a crash (or loss of connectivity) after either transaction is prepared but before both transactions are committed, we can still get things back to a consistent state once things are back up again. How? We look to see if either transaction committed; if so, we commit the other one. If not, we see whether both transactions were succesfully prepared; if so, we can commit or abort both; if not, we must abort both.
This solves the problem of making sure that no money can be permanently lost (or created), but there will still be a period of time during which we can see inconsistent views of the system as a whole. Imagine that the bank auditor comes along and runs a report across all bank branches adding up the bank's assets and liabilities. It's possible that he'll query one of the two databases involved in our hypothetical funds transfer before the transaction commits on that node, but by the time he visits the other one, it's committed - therefore he'll see the transferred funds either in both accounts, or in neither one, depending on the order in which he hits the different branches. This is a distributed serialization anomaly.
Distributed serialization anomalies are much harder to avoid than regular serialization anomalies (which are a hard problem all by themselves). One method - which is used by Postgres-XC - is to have a single authority (which Postgres-XC calls a global transaction manager) which hands out snapshots and transaction IDs across all nodes in the cluster; regrettably, there is a potential for this to become a bottleneck, or a single point of failure (see Postgres-XC_Write-Scalable_Cluster.pptx, slides 10 and following).
Unfortunately, there may not be many good alternatives. There is a technology called commitment ordering which seems to have a long paper trail[1] in the academic literature, and which has been studied in relation to MVCC. The good news is that commitment ordering does not require a global coordinator of any kind; each node operates independently and does not even need to know the identities of the other nodes, or even how many exist. It requires no additional communication of any kind. The bad news is that it operates by aborting potentially problematic transactions, and it might end up aborting quite a lot of them. The rule is simply that the serialization order must match the commit order; so if transaction A reads x and writes y, transaction B reads y; and then transaction A commits, the system will abort B (because there could be a read-write dependency cycle between A and B involving another database).
Another alternative is to build up a commit-order dependency graph that spans all the databases involved in the transaction. That is, we imagine a graph with each unaborted transaction as a vertex. If A reads or updates a row and B subsequently updates it, we add an edge from A to B. If A updates a row and B subsequently reads the updated version (or a later version), we also add an edge from A to B. If, at any time, adding an edge to the graph would create a cycle, we abort one of the constituent transactions. Kevin Grittner and Emmanuel Cecchet pointed out a paper by Michael Cahill on this topic[2]; one of the advantages of this approach is that it is possible to prevent all serialization anomalies, which our current approach does not. Kevin and Dan Ports have proposed a patch for 9.1 which would implement true serializability for a single PostgreSQL database, but it's not clear that this would scale well to a distributed system.
[1] e.g. The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource-Managers, Yoav Raz, 1990 [PDF].
[2] Serializable Isolation for Snapshot Databases, Michael J. Cahill, Uwe Röhm, Alan D. Fekete, 2006 [PDF].
Monday, July 05, 2010
Concurrent Development
PostgreSQL 9.0 beta 3 will be wrapped in the next few days, and at the same time, we'll be branching the tree to begin 9.1 development. This is a new thing for us. In the past, we've waited until the previous release was shipped before opening the tree to new development. However, at the PGCon 2010 development meeting, we decided to try something different this time.
I believe that the primary motivation for this change was that, as we get closer to release, there are fewer and fewer issues to work on, and fewer and fewer people who can be involved in fixing them. So, waiting until release to branch the tree leaves a substantial portion of the developer community sitting idle. A second advantage is that it shortens the time between releases - our tentative plan is to use the same release schedule for 9.1 that we did for 9.0. The first CommitFest for 9.0 began on July 15, 2009, and the first CommitFest for 9.1 will begin on July 15, 2010; the last CommitFest for 9.0 began on January 15, 2010, and the last CommitFest for 9.1 will begin on January 15, 2011. Of course, the actual release date will almost certainly be different, but the plan is for feature freeze to happen about the same time next year that it did this year, so that we can continue to have releases about a year apart.
Of course, the danger of concurrent development is that the work people are doing for 9.1 may distract us from finishing 9.0. Hopefully that won't happen, because I think there is a lot to like about the new process.
I believe that the primary motivation for this change was that, as we get closer to release, there are fewer and fewer issues to work on, and fewer and fewer people who can be involved in fixing them. So, waiting until release to branch the tree leaves a substantial portion of the developer community sitting idle. A second advantage is that it shortens the time between releases - our tentative plan is to use the same release schedule for 9.1 that we did for 9.0. The first CommitFest for 9.0 began on July 15, 2009, and the first CommitFest for 9.1 will begin on July 15, 2010; the last CommitFest for 9.0 began on January 15, 2010, and the last CommitFest for 9.1 will begin on January 15, 2011. Of course, the actual release date will almost certainly be different, but the plan is for feature freeze to happen about the same time next year that it did this year, so that we can continue to have releases about a year apart.
Of course, the danger of concurrent development is that the work people are doing for 9.1 may distract us from finishing 9.0. Hopefully that won't happen, because I think there is a lot to like about the new process.