Monday, December 18, 2017

MVCC and VACUUM

Experienced PostgreSQL users and developers rattle off the terms “MVCC” and “VACUUM” as if everyone should know what they are and how they work, but in fact many people don’t. This blog post is my attempt to explain what MVCC is and why PostgreSQL uses it, what VACUUM is and how it works, and why we need VACUUM to implement MVCC. In addition, I’ll include a few useful links for further reading for those who may want to know more.

PostgreSQL is a relational database and therefore attempts to provide transactions which have the “ACID” properties: atomicity, consistency, isolation, and durability. Let’s ignore consistency and durability for the moment. Atomicity means that transactions are “all or nothing”; either the transaction commits and all of its changes take effect, or it aborts and none of the changes take effect. Isolation means that each transaction should be unaware that there are other transactions executing concurrently. For example, if transaction A updates two rows with a = 1 to have a = 2 and concurrent transaction B deletes all rows with a = 2, it should never happen that one of the two rows updated by A gets deleted and the other does not. Either A “happens first” and thus both rows are deleted, or B “happens first” and thus neither row is deleted. Taken together, this means that the database would like to give the user the impression that no concurrent changes are taking place while their transaction is in progress, and that when their transaction commits, all of the changes it made become instantaneously and simultaneously visible to everyone.

Relational databases in general - not PostgreSQL specifically - have settled on two ways of doing this. The first is two-phase locking: a transaction takes a shared lock on every bit of data it reads and an exclusive lock on every bit of data that it writes. When it commits, it releases those locks; if it aborts, it reverses the changes that it made and then releases locks. Because any modified data is protected by an exclusive lock, no concurrent transaction will see that change until it is committed. Meanwhile, readers can coexist, since their locks are all shared. Unfortunately, when data is both read and written, this approach has poor concurrency: readers and writers block each other, because the shared locks taken by readers conflict with the exclusive locks taken by writers. Locks are bad for concurrency, and two-phase locking requires exclusive locks (and shared locks, for full correctness) to be held for the entire duration of the transaction.

The second approach to providing transactions with atomicity and isolation is multi-version concurrency control (MVCC). The basic idea is simple: instead of locking a row that we want to update, let’s just create a new version of it which, initially, is visible only to the transaction which created it. Once the updating transaction commits, we’ll make the new row visible to all new transactions that start after that point, while existing transactions continue to see the old row. That way, every transaction sees what is in essence a consistent snapshot of the database. Finally, when there are no transactions remaining that can see the old version of the row, we can discard it. This approach has much better concurrency than two-phase locking, which is why it is used by many modern database systems. However, we’ve exchanged our concurrency problem for a space management problem. With two-phase locking, a row that gets updated can be updated in place; because the updating transaction holds an exclusive lock, nobody else can see the modifications until the lock is released. With MVCC, this won’t work, since we need to keep both versions at least for a short while (or perhaps for a long while, if there are long-running concurrent transactions). Either the old row version must be saved elsewhere, and the new version put in its place, or the old row version must be left untouched, and the new version stored elsewhere. This is an area where different database systems have adopted different approaches.

PostgreSQL’s approach to this problem is to store new row versions created by UPDATE in basically the same way that it would store a completely new row generated by an INSERT. Once an UPDATE commits, future SQL statements (or future transactions, depending on your choice of isolation level) will see the new row versions, and the old ones will be visible only to statements (or transactions) which are already running. Eventually, there will no longer be anything running which can see the old row versions, at which point they are “dead”. Similarly, when a DELETE commits, there will eventually be nothing running which can see the old row versions, at which point they are likewise “dead”. Similarly, if an INSERT aborts, the new row versions which it inserted go from being visible only to the inserting transaction to being visible to nobody at all, and are thus “dead”, and likewise when an UPDATE aborts, any new row versions which it created are “dead”. The Heroku Dev Center has a well-written article explaining some details of how this works under the hood.

And that brings us to the need for VACUUM. Any system which uses MVCC to implement transaction isolation must have some way of getting rid of dead row versions. If it did not, the number of dead row versions would grow without bound, and therefore the database size would grow without bound, which would be bad. At some point, we must get rid of the dead row versions so that the space can be reused. In PostgreSQL, this is VACUUM’s job. Some other systems use different cleanup mechanisms for different kinds of dead row versions; for example, dead row versions created by aborted transactions may be handled differently from dead row versions created by committed transactions. In PostgreSQL, we handle of these in the same way. It bears mentioning that in certain cases, dead row versions can be cleaned up using a page-at-a-time mechanism called HOT pruning (for technical details, see README.HOT). These cases are actually quite common; however, HOT pruning is a “best effort” strategy, and VACUUM is the backstop that makes sure the work gets done and completes whatever cleanup can’t be done by HOT pruning. (This brief explanation omits many interesting details, but this post is already getting quite long.)

So, here’s how VACUUM works. First, it scans every page in the table (also known as the “heap”) that might possibly contain dead row versions. A data structure called the visibility map keeps track of which pages have been modified since the last VACUUM; those that have not been may be skipped. As it does, it removes dead row versions from those pages and makes the space occupied by those tuples available for reuse. During this process, dead row versions are replaced with a sort of tombstone - in technical terms, the line pointer that pointed to the row version which was removed is marked LP_DEAD. If no dead tuples are found during this stage, VACUUM stops. Otherwise, it scans every index on the table and removes all index entries which point to the dead line pointers identified during the first phase. Once this phase is complete, it goes back to the table and removes the tombstones - in technical terms, the LP_DEAD line pointers are marked LP_UNUSED. Once this is done, those line pointers can be reused for new tuples. If those line pointers were reused before the index entries were removed, we would potentially end up with index entries that don’t match the row versions to which they point, which could cause queries to return wrong answers.

The previous paragraph used the term “line pointer”, but did not define it. Each page begins with an array of line pointers which identify the starting position of each tuple on the page, and the length of that tuple. There are a few special kinds of line pointers, including LP_UNUSED (which means that the array slot is available to be used for a new tuple), LP_DEAD (which, as noted above, means that the tuple has been removed from the page but the line pointer slot isn’t available to be reused yet), and LP_REDIRECT (which is related to HOT, but beyond the scope of this blog post).

In short, VACUUM scans the parts of the table that have changed since the last VACUUM, then scans the indexes in their entirety, then scans the changed parts of the table again. By doing so, it ensures that dead row versions and the index entries which referenced them are removed, and that the line pointers which referenced those dead row versions are made available for reuse.

Some final notes:

1. An ordinary VACUUM is very different from VACUUM FULL, which actually rewrites the whole table.

2. In addition to its function in removing dead tuples, VACUUM also prevents transaction ID wraparound, updates statistics used by the query planner, and updates the visibility map. Updating the visibility map is important not only to make future VACUUM operations more efficient, but also to make index-only scans work better. This is all covered in the documentation.

3. Generally, you don’t need to run VACUUM by hand, because the autovacuum daemon will arrange to run it automatically. However, there are definitely times when it’s useful to run VACUUM by hand. This can be a good idea, for example, after a bulk load, or during a low-usage time of day, to reduce the need for later vacuuming during high-usage periods.

12 comments:

  1. what the MVCC supporters never seem to discuss: does the protocol actually improve concurrency? if A reads row X, then B reads row X, and B updates row X before A does, what happens to A's session? does it abort, since it's trying to update an old row? does it commit, since "last writer wins", even though A had temporal priority?

    classic locking, with intelligent transaction scope, ensures that "last writer wins" is always really the last writer.

    ReplyDelete
    Replies
    1. The answer to the question "does the protocol actually improve concurrency?" is clearly yes. In your example, if A never tries to write X, A and B do not conflict and can run at the same time, whereas with two-phase locking that would not be possible.

      In answer to your question about what happens if A tries to update X after B has already done so, please note that it is not the case that these issues are never discussed. In fact, these precise issues are discussed in detail in the documentation about isolation levels to which I linked above: https://www.postgresql.org/docs/current/static/transaction-iso.html

      The short answer is that, if A tries to update row X after B has already done so, then, if the transaction isolation level is REPEATABLE READ or higher, A will fail with a serialization error. At READ COMMITTED, we do something complicated: we re-check whether the updated version of row X still matches the selection criteria of the query, and if so, we apply the updates to the new version of the tuple. This behavior is a little bit magical but works well in simple cases, again as explained the documentation linked above. And again, it's optional and stricter behavior is possible just by raising the isolation level.

      All that having been said, I imagine that it is possible to find a workload where (a) the READ COMMITTED behavior of trying to apply the update to the modified row is problematic but (b) increasing the isolation level produces too many retries such that the final result is worse than just waiting for a lock. I doubt that it is typical, because otherwise it seems unlikely that not only PostgreSQL but also Oracle and MySQL/InnoDB would support it as the only mode, or that Microsoft would have added it as an optional to SQL Server releases.

      https://en.wikipedia.org/wiki/List_of_databases_using_MVCC is an impressive list.

      Delete
    2. MySQL does something absolutely ridiculous here. https://blog.pythian.com/understanding-mysql-isolation-levels-repeatable-read/

      Delete
  2. typo:
    "even though B had temporal priority"

    ReplyDelete
  3. thanx. although my question was not PG specific, which you were kind enough to describe, but whether the MVCC protocol itself defined behavior in the case of multiple writers. I don't quibble with the case of multi-read, single-write behavior. it just seems, at least to me, that classic row-locking protocol behaves much the same in any engine, while MVCC implementations are free to (or, have to) define their own protocol.

    my ultimate fantasy got-to-have is a sort of column level locking; viz. updates to non-key (and, by definition in xNF no non-key dependencies) columns are never blocked. any write to such columns are never controlled; they are, after all, "just data".

    ReplyDelete
    Replies
    1. I think you're somewhat right. MVCC is a general approach and there are variations in how that approach is implemented. That's probably also true of 2PL to some degree, but maybe not as much.

      I agree with you that column-level locking would be really nice. Since PostgreSQL 9.3, we distinguish between updates to "key" columns and "non-key" columns, but that only lets you lock all columns that are considered non-key, or else everything. It would be nice to have locking granularity all the way down to the individual column level, or maybe even more fine-grained in the case of array, JSON, or XML values.

      Delete
  4. Does Postgres need to go and rewrite pages for all tables (event ones that were not changed) every 200M transactions or so? Or this was fixed in recent version?

    ReplyDelete
  5. It never needed to rewrite unchanged pages more than once, but it does need to rewrite them once. It used to scan them every 200M transactions even though they were not rewritten, but that's fixed as of PostgreSQL 9.6. Now, if they haven't been modified since the last scan, they won't be scanned again.

    Also note that the default value of autovacuum_freeze_max_age, by default 200M, can be raised significantly. The main problem is that if the scan -- which is triggered when that limit is hit -- doesn't start and complete successfully before we use 2B transaction IDs since the last scan, the cluster will shut down. However, 200M -> 2B leaves a lot of headroom; that limit can be doubled, tripled, or even quadrupled without much problem, especially if accompanied by a good monitoring regimen.

    ReplyDelete
    Replies
    1. Perfect - thank you for confirming!
      We hit this issue on poorly written application ran lot's of TPS while database was also storing some very large historical tables and vacuum couldn't keep up (due to slow storage hardware). Would be great to see read-only or trasportable tablespaces in some next release!

      Delete
    2. I agree, but in the meantime, the PostgreSQL 9.6 changes should help a lot for that kind of use case, since as of that version you won't need to keep rescanning the historical tables.

      Delete
  6. Hi Bob,

    My question is how does Postgres handle the old versions of UPDATE. For example, there are two versions for a given tuple (before and after UPDATE). How could VACCUM know the old version is still being used by other txns? Like txn-A reads this tuple 2 times. The first time txn-A read a version, then this tuple is UPDATED by txn-B, then txn-A read this tuple again. VACCUM should not free the old tuple before txn-A completes. But how could VACCUM know that?

    ReplyDelete
    Replies
    1. Every SELECT, UPDATE, etc. operates according to what we call a snapshot of the database. We keep track of the oldest snapshots that exist anywhere in the system. So, in your example, once txn-B is known to have committed longer ago than any snapshot in the system was taken, we know that the old row version is no longer visible to anyone and can be removed.

      Delete