What if PostgreSQL didn’t need VACUUM at all? This seems hard to imagine. After all, PostgreSQL uses multi-version concurrency control (MVCC), and if you create multiple versions of rows, you have to eventually get rid of the row versions somehow. In PostgreSQL, VACUUM is in charge of making sure that happens, and the autovacuum process is in charge of making sure that happens soon enough. Yet, other schemes are possible, as shown by the fact that not all relational databases handle MVCC in the same way, and there are reasons to believe that PostgreSQL could benefit significantly from adopting a new approach. In fact, many of my colleagues at EnterpriseDB are busy implementing a new approach, and today I’d like to tell you a little bit about what we’re doing and why we’re doing it.
While it’s certainly true that VACUUM has significantly improved over the years, there are some problems that are very difficult to solve in the current system structure. Because old row versions and new row versions are stored in the same place - the table, also known as the heap - updating a large number of rows must, at least temporarily, make the heap bigger. Depending on the pattern of updates, it may be impossible to easily shrink the heap again afterwards. For example, imagine loading a large number of rows into a table and then updating half of the rows in each block. The table size must grow by 50% to accommodate the new row versions. When VACUUM removes the old versions of those rows, the original table blocks are now all 50% full. That space is available for new row versions, but there is no easy way to move the rows from the new newly-added blocks back to the old half-full blocks: you can use VACUUM FULL or you can use third-party tools like pg_repack, but either way you end up rewriting the whole table. Proposals have been made to try to relocate rows on the fly, but it’s hard to do correctly and risks bloating the indexes, since each row moved requires a new entry in each index to point to the new location of the row.
If heap bloat is caused by a single gigantic update, it can often be avoided by breaking down the large update into a series of smaller updates, running VACUUM in between or giving autovacuum a chance to do so. In this way, old row versions from the first update are reclaimed and become free space that can be reused for the second update, and so bloat is reduced. However, there are some access patterns where the table becomes bloated not because of one big update but from many small updates occurring over a long period of time. A simple example is to open a transaction which does a single-row UPDATE and then remains idle for a long time; meanwhile, other transactions continue to do writes to the database, large or small. Whichever tables are being frequently modified will bloat, and once again, there’s no easy way to shrink them again afterwards. To be clear, I’m not saying that it’s a particularly good idea to open a write transaction and then leave the session idle for a long time; typically, when this happens, it’s the result of a poorly written client application that forgot it had a transaction open. Moreover, any relational database is going to suffer under such a workload. In my view, the problem isn’t so much that PostgreSQL can’t cope with such a situation gracefully - that would be too much to expect - but that it’s potentially quite painful to recover afterward. Long-running reporting queries can create similar problems.
To put this another way, it is in general true that PostgreSQL’s VACUUM implementation has gotten progressively better at reclaiming space occupied by dead tuples more quickly and with less expenditure of effort. And that’s really good, because the faster you reclaim space, the less new space you end up allocating, which keeps tables small and performance high. However, the examples above show that VACUUM isn’t the whole problem. In these examples, even if VACUUM ran at the earliest instant when it could reclaim the space occupied by dead tuples and ran infinitely fast, the table would still become bloated. In the case where the bloat is caused by many short queries run while one long-running transaction remains open, we could, with smarter snapshot management, limit the worst-case bloat to approximately a factor of two -- that is, we’d keep the version of the tuple visible to the old snapshot and the current version, and discard the intermediate versions, a trick PostgreSQL currently can’t manage. However, even a factor of two is a lot, and what if there are multiple distinct open snapshots? Further, in the case where the bloat is created by a SQL statement that induces scattered updates throughout the table, no improvement to VACUUM can possibly help. By the time that SQL statement finishes, the damage is already done.
In some sense, therefore, blaming bloat on deficiencies of VACUUM is like blaming your local garbage collector for the fact that your house is constantly surrounded by dozens of completely full trash barrels. In such a situation, it might be true that the garbage collector should come around a bit more frequently or work a little faster or harder, but maybe part of the problem is YOU. Unless your trash service is exceptionally bad, to have such a large amount of garbage around all the time, you must be generating it at an enormous rate. If you stop throwing so much stuff away, it won’t matter as much how often the garbage collector comes around. It might also help if you put all of the trash in one giant dumpster instead of many separate barrels strewn hither and yon.
The problems in this area stem largely from the fact that PostgreSQL is unable to do a true, transactional UPDATE in place. Under the hood, UPDATE looks a lot like a DELETE plus an INSERT. If your table contains no free space and you execute a transaction that updates one tuple, then there are two things that are definitely true, regardless of whether the UPDATE transaction commits or aborts. The first is that the table must grow in order to accommodate the new row version, and the second is that we end up with a dead row version - either the old version ends up dead, if the transaction commits, or the new version becomes dead immediately, if the transaction aborts. Either way, the table is now a tiny bit bloated, and either way, there’s now work for VACUUM to do.
This system is very much symmetric. A transaction that commits generates pretty much the same amount of work as a transaction that aborts. This is very elegant, but it’s not best in practice, because very few people run workloads where anywhere close to 50% of transactions abort. (Anyone who has such a workload will, I suspect, find that PostgreSQL handily outperforms the competition.) It would be better to have a system where we try to make commits cheaper, and aborts more expensive.
That brings me to the design which EnterpriseDB is proposing. We are working to build a new table storage format for PostgreSQL, which we’re calling zheap. In a zheap, whenever possible, we handle an UPDATE by moving the old row version to an undo log, and putting the new row version in the place previously occupied by the old one. If the transaction aborts, we retrieve the old row version from undo and put it back in the original location; if a concurrent transaction needs to see the old row version, it can find it in undo. Of course, this doesn’t work when the block is full and the row is getting wider, and there are some other problem cases as well, but it covers many useful cases. In the typical case, therefore, even bulk updates do not force a zheap to grow. Instead, the undo grows. When a transaction commits, all row versions that will become dead are in the undo, not the zheap.
This means that there is no need for VACUUM, or any similar process, to scan the table looking for dead rows. The only time the table contains dead rows is when a transaction aborts, and in that case we immediately use the undo to go put the old, still-living row versions back where they were. That process is targeted and does not involve needing to scan the entire table. When a transaction commits, we can eventually discard the undo it generated; this is a bulk operation and can be done very quickly.
Handling indexes is more complex, but we believe that we can also eliminate the need for index vacuuming using the undo infrastructure. That topic, however, would make this blog post far longer than it already is, so I will leave it for another time. There is also a great deal more detail about the design of zheap which I would like to write about, but that, too, will need to wait for another time. This post is intended to explain why we have undertaken this work, not exactly what we are doing.
I realize that there is likely to be a good deal of skepticism among experienced PostgreSQL hackers about the feasibility of making this approach work. I do not claim that we have solved all of the problems, nor that success is assured. A huge amount of work remains to be done. Even if all of that work is successfully completed and even if all of the remaining problems are solved, there will probably still be cases where the existing heap outperforms zheap. That having been said, we have built enough of this new system that parts of it can be tested against the current heap implementation, and those test results look promising. In the near future, we will release the code we have so far under the PostgreSQL license so that the whole PostgreSQL community can look at it, form their own opinions, and run their own tests.
Although I did much of the basic design work, the development lead for this project is Amit Kapila, who has been assisted by Dilip Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, and Beena Emerson. Thomas Munro wrote the undo storage system. Marc Linster has provided unfailing management support, and Andres Freund has provided some design input (and criticism). Thanks to all of them.