Robert Haas: MVCC and VACUUM

Monday, December 18, 2017

MVCC and VACUUM

Experienced PostgreSQL users and developers rattle off the terms “MVCC” and “VACUUM” as if everyone should know what they are and how they work, but in fact many people don’t. This blog post is my attempt to explain what MVCC is and why PostgreSQL uses it, what VACUUM is and how it works, and why we need VACUUM to implement MVCC. In addition, I’ll include a few useful links for further reading for those who may want to know more.

PostgreSQL is a relational database and therefore attempts to provide transactions which have the “ACID” properties: atomicity, consistency, isolation, and durability. Let’s ignore consistency and durability for the moment. Atomicity means that transactions are “all or nothing”; either the transaction commits and all of its changes take effect, or it aborts and none of the changes take effect. Isolation means that each transaction should be unaware that there are other transactions executing concurrently. For example, if transaction A updates two rows with a = 1 to have a = 2 and concurrent transaction B deletes all rows with a = 2, it should never happen that one of the two rows updated by A gets deleted and the other does not. Either A “happens first” and thus both rows are deleted, or B “happens first” and thus neither row is deleted. Taken together, this means that the database would like to give the user the impression that no concurrent changes are taking place while their transaction is in progress, and that when their transaction commits, all of the changes it made become instantaneously and simultaneously visible to everyone.

Relational databases in general - not PostgreSQL specifically - have settled on two ways of doing this. The first is two-phase locking: a transaction takes a shared lock on every bit of data it reads and an exclusive lock on every bit of data that it writes. When it commits, it releases those locks; if it aborts, it reverses the changes that it made and then releases locks. Because any modified data is protected by an exclusive lock, no concurrent transaction will see that change until it is committed. Meanwhile, readers can coexist, since their locks are all shared. Unfortunately, when data is both read and written, this approach has poor concurrency: readers and writers block each other, because the shared locks taken by readers conflict with the exclusive locks taken by writers. Locks are bad for concurrency, and two-phase locking requires exclusive locks (and shared locks, for full correctness) to be held for the entire duration of the transaction.

The second approach to providing transactions with atomicity and isolation is multi-version concurrency control (MVCC). The basic idea is simple: instead of locking a row that we want to update, let’s just create a new version of it which, initially, is visible only to the transaction which created it. Once the updating transaction commits, we’ll make the new row visible to all new transactions that start after that point, while existing transactions continue to see the old row. That way, every transaction sees what is in essence a consistent snapshot of the database. Finally, when there are no transactions remaining that can see the old version of the row, we can discard it. This approach has much better concurrency than two-phase locking, which is why it is used by many modern database systems. However, we’ve exchanged our concurrency problem for a space management problem. With two-phase locking, a row that gets updated can be updated in place; because the updating transaction holds an exclusive lock, nobody else can see the modifications until the lock is released. With MVCC, this won’t work, since we need to keep both versions at least for a short while (or perhaps for a long while, if there are long-running concurrent transactions). Either the old row version must be saved elsewhere, and the new version put in its place, or the old row version must be left untouched, and the new version stored elsewhere. This is an area where different database systems have adopted different approaches.

PostgreSQL’s approach to this problem is to store new row versions created by UPDATE in basically the same way that it would store a completely new row generated by an INSERT. Once an UPDATE commits, future SQL statements (or future transactions, depending on your choice of isolation level) will see the new row versions, and the old ones will be visible only to statements (or transactions) which are already running. Eventually, there will no longer be anything running which can see the old row versions, at which point they are “dead”. Similarly, when a DELETE commits, there will eventually be nothing running which can see the old row versions, at which point they are likewise “dead”. Similarly, if an INSERT aborts, the new row versions which it inserted go from being visible only to the inserting transaction to being visible to nobody at all, and are thus “dead”, and likewise when an UPDATE aborts, any new row versions which it created are “dead”. The Heroku Dev Center has a well-written article explaining some details of how this works under the hood.

And that brings us to the need for VACUUM. Any system which uses MVCC to implement transaction isolation must have some way of getting rid of dead row versions. If it did not, the number of dead row versions would grow without bound, and therefore the database size would grow without bound, which would be bad. At some point, we must get rid of the dead row versions so that the space can be reused. In PostgreSQL, this is VACUUM’s job. Some other systems use different cleanup mechanisms for different kinds of dead row versions; for example, dead row versions created by aborted transactions may be handled differently from dead row versions created by committed transactions. In PostgreSQL, we handle of these in the same way. It bears mentioning that in certain cases, dead row versions can be cleaned up using a page-at-a-time mechanism called HOT pruning (for technical details, see README.HOT). These cases are actually quite common; however, HOT pruning is a “best effort” strategy, and VACUUM is the backstop that makes sure the work gets done and completes whatever cleanup can’t be done by HOT pruning. (This brief explanation omits many interesting details, but this post is already getting quite long.)

So, here’s how VACUUM works. First, it scans every page in the table (also known as the “heap”) that might possibly contain dead row versions. A data structure called the visibility map keeps track of which pages have been modified since the last VACUUM; those that have not been may be skipped. As it does, it removes dead row versions from those pages and makes the space occupied by those tuples available for reuse. During this process, dead row versions are replaced with a sort of tombstone - in technical terms, the line pointer that pointed to the row version which was removed is marked LP_DEAD. If no dead tuples are found during this stage, VACUUM stops. Otherwise, it scans every index on the table and removes all index entries which point to the dead line pointers identified during the first phase. Once this phase is complete, it goes back to the table and removes the tombstones - in technical terms, the LP_DEAD line pointers are marked LP_UNUSED. Once this is done, those line pointers can be reused for new tuples. If those line pointers were reused before the index entries were removed, we would potentially end up with index entries that don’t match the row versions to which they point, which could cause queries to return wrong answers.

The previous paragraph used the term “line pointer”, but did not define it. Each page begins with an array of line pointers which identify the starting position of each tuple on the page, and the length of that tuple. There are a few special kinds of line pointers, including LP_UNUSED (which means that the array slot is available to be used for a new tuple), LP_DEAD (which, as noted above, means that the tuple has been removed from the page but the line pointer slot isn’t available to be reused yet), and LP_REDIRECT (which is related to HOT, but beyond the scope of this blog post).

In short, VACUUM scans the parts of the table that have changed since the last VACUUM, then scans the indexes in their entirety, then scans the changed parts of the table again. By doing so, it ensures that dead row versions and the index entries which referenced them are removed, and that the line pointers which referenced those dead row versions are made available for reuse.

Some final notes:

1. An ordinary VACUUM is very different from VACUUM FULL, which actually rewrites the whole table.

2. In addition to its function in removing dead tuples, VACUUM also prevents transaction ID wraparound, updates statistics used by the query planner, and updates the visibility map. Updating the visibility map is important not only to make future VACUUM operations more efficient, but also to make index-only scans work better. This is all covered in the documentation.

3. Generally, you don’t need to run VACUUM by hand, because the autovacuum daemon will arrange to run it automatically. However, there are definitely times when it’s useful to run VACUUM by hand. This can be a good idea, for example, after a bulk load, or during a low-usage time of day, to reduce the need for later vacuuming during high-usage periods.

15 comments:

AnonymousDecember 18, 2017 1:41 PM
what the MVCC supporters never seem to discuss: does the protocol actually improve concurrency? if A reads row X, then B reads row X, and B updates row X before A does, what happens to A's session? does it abort, since it's trying to update an old row? does it commit, since "last writer wins", even though A had temporal priority?

classic locking, with intelligent transaction scope, ensures that "last writer wins" is always really the last writer.
ReplyDelete
Replies
AnonymousDecember 18, 2017 1:42 PM
typo:
"even though B had temporal priority"
ReplyDelete
Replies
AnonymousDecember 18, 2017 2:47 PM
thanx. although my question was not PG specific, which you were kind enough to describe, but whether the MVCC protocol itself defined behavior in the case of multiple writers. I don't quibble with the case of multi-read, single-write behavior. it just seems, at least to me, that classic row-locking protocol behaves much the same in any engine, while MVCC implementations are free to (or, have to) define their own protocol.

my ultimate fantasy got-to-have is a sort of column level locking; viz. updates to non-key (and, by definition in xNF no non-key dependencies) columns are never blocked. any write to such columns are never controlled; they are, after all, "just data".
ReplyDelete
Replies
AnonymousDecember 18, 2017 8:30 PM
Does Postgres need to go and rewrite pages for all tables (event ones that were not changed) every 200M transactions or so? Or this was fixed in recent version?
ReplyDelete
Replies
Robert HaasDecember 19, 2017 11:51 AM
It never needed to rewrite unchanged pages more than once, but it does need to rewrite them once. It used to scan them every 200M transactions even though they were not rewritten, but that's fixed as of PostgreSQL 9.6. Now, if they haven't been modified since the last scan, they won't be scanned again.

Also note that the default value of autovacuum_freeze_max_age, by default 200M, can be raised significantly. The main problem is that if the scan -- which is triggered when that limit is hit -- doesn't start and complete successfully before we use 2B transaction IDs since the last scan, the cluster will shut down. However, 200M -> 2B leaves a lot of headroom; that limit can be doubled, tripled, or even quadrupled without much problem, especially if accompanied by a good monitoring regimen.
ReplyDelete
Replies
UnknownDecember 30, 2017 1:14 AM
Hi Bob,

My question is how does Postgres handle the old versions of UPDATE. For example, there are two versions for a given tuple (before and after UPDATE). How could VACCUM know the old version is still being used by other txns? Like txn-A reads this tuple 2 times. The first time txn-A read a version, then this tuple is UPDATED by txn-B, then txn-A read this tuple again. VACCUM should not free the old tuple before txn-A completes. But how could VACCUM know that?
ReplyDelete
Replies
Kirttle KirttleJanuary 10, 2018 12:30 PM
Great Post, Great Questions, And Great Answers.
Thanks
ReplyDelete
Replies
UnknownJanuary 11, 2018 2:27 PM
Can a long running query on a table block the autovacuum worker on the table and thus cause data loss due to transaction ID wraparound?
ReplyDelete
Replies

Add comment