What if PostgreSQL didn’t need VACUUM at all? This seems hard to imagine. After all, PostgreSQL uses multi-version concurrency control (MVCC), and if you create multiple versions of rows, you have to eventually get rid of the row versions somehow. In PostgreSQL, VACUUM is in charge of making sure that happens, and the autovacuum process is in charge of making sure that happens soon enough. Yet, other schemes are possible, as shown by the fact that not all relational databases handle MVCC in the same way, and there are reasons to believe that PostgreSQL could benefit significantly from adopting a new approach. In fact, many of my colleagues at EnterpriseDB are busy implementing a new approach, and today I’d like to tell you a little bit about what we’re doing and why we’re doing it.
While it’s certainly true that VACUUM has significantly improved over the years, there are some problems that are very difficult to solve in the current system structure. Because old row versions and new row versions are stored in the same place - the table, also known as the heap - updating a large number of rows must, at least temporarily, make the heap bigger. Depending on the pattern of updates, it may be impossible to easily shrink the heap again afterwards. For example, imagine loading a large number of rows into a table and then updating half of the rows in each block. The table size must grow by 50% to accommodate the new row versions. When VACUUM removes the old versions of those rows, the original table blocks are now all 50% full. That space is available for new row versions, but there is no easy way to move the rows from the new newly-added blocks back to the old half-full blocks: you can use VACUUM FULL or you can use third-party tools like pg_repack, but either way you end up rewriting the whole table. Proposals have been made to try to relocate rows on the fly, but it’s hard to do correctly and risks bloating the indexes, since each row moved requires a new entry in each index to point to the new location of the row.
If heap bloat is caused by a single gigantic update, it can often be avoided by breaking down the large update into a series of smaller updates, running VACUUM in between or giving autovacuum a chance to do so. In this way, old row versions from the first update are reclaimed and become free space that can be reused for the second update, and so bloat is reduced. However, there are some access patterns where the table becomes bloated not because of one big update but from many small updates occurring over a long period of time. A simple example is to open a transaction which does a single-row UPDATE and then remains idle for a long time; meanwhile, other transactions continue to do writes to the database, large or small. Whichever tables are being frequently modified will bloat, and once again, there’s no easy way to shrink them again afterwards. To be clear, I’m not saying that it’s a particularly good idea to open a write transaction and then leave the session idle for a long time; typically, when this happens, it’s the result of a poorly written client application that forgot it had a transaction open. Moreover, any relational database is going to suffer under such a workload. In my view, the problem isn’t so much that PostgreSQL can’t cope with such a situation gracefully - that would be too much to expect - but that it’s potentially quite painful to recover afterward. Long-running reporting queries can create similar problems.
To put this another way, it is in general true that PostgreSQL’s VACUUM implementation has gotten progressively better at reclaiming space occupied by dead tuples more quickly and with less expenditure of effort. And that’s really good, because the faster you reclaim space, the less new space you end up allocating, which keeps tables small and performance high. However, the examples above show that VACUUM isn’t the whole problem. In these examples, even if VACUUM ran at the earliest instant when it could reclaim the space occupied by dead tuples and ran infinitely fast, the table would still become bloated. In the case where the bloat is caused by many short queries run while one long-running transaction remains open, we could, with smarter snapshot management, limit the worst-case bloat to approximately a factor of two -- that is, we’d keep the version of the tuple visible to the old snapshot and the current version, and discard the intermediate versions, a trick PostgreSQL currently can’t manage. However, even a factor of two is a lot, and what if there are multiple distinct open snapshots? Further, in the case where the bloat is created by a SQL statement that induces scattered updates throughout the table, no improvement to VACUUM can possibly help. By the time that SQL statement finishes, the damage is already done.
In some sense, therefore, blaming bloat on deficiencies of VACUUM is like blaming your local garbage collector for the fact that your house is constantly surrounded by dozens of completely full trash barrels. In such a situation, it might be true that the garbage collector should come around a bit more frequently or work a little faster or harder, but maybe part of the problem is YOU. Unless your trash service is exceptionally bad, to have such a large amount of garbage around all the time, you must be generating it at an enormous rate. If you stop throwing so much stuff away, it won’t matter as much how often the garbage collector comes around. It might also help if you put all of the trash in one giant dumpster instead of many separate barrels strewn hither and yon.
The problems in this area stem largely from the fact that PostgreSQL is unable to do a true, transactional UPDATE in place. Under the hood, UPDATE looks a lot like a DELETE plus an INSERT. If your table contains no free space and you execute a transaction that updates one tuple, then there are two things that are definitely true, regardless of whether the UPDATE transaction commits or aborts. The first is that the table must grow in order to accommodate the new row version, and the second is that we end up with a dead row version - either the old version ends up dead, if the transaction commits, or the new version becomes dead immediately, if the transaction aborts. Either way, the table is now a tiny bit bloated, and either way, there’s now work for VACUUM to do.
This system is very much symmetric. A transaction that commits generates pretty much the same amount of work as a transaction that aborts. This is very elegant, but it’s not best in practice, because very few people run workloads where anywhere close to 50% of transactions abort. (Anyone who has such a workload will, I suspect, find that PostgreSQL handily outperforms the competition.) It would be better to have a system where we try to make commits cheaper, and aborts more expensive.
That brings me to the design which EnterpriseDB is proposing. We are working to build a new table storage format for PostgreSQL, which we’re calling zheap. In a zheap, whenever possible, we handle an UPDATE by moving the old row version to an undo log, and putting the new row version in the place previously occupied by the old one. If the transaction aborts, we retrieve the old row version from undo and put it back in the original location; if a concurrent transaction needs to see the old row version, it can find it in undo. Of course, this doesn’t work when the block is full and the row is getting wider, and there are some other problem cases as well, but it covers many useful cases. In the typical case, therefore, even bulk updates do not force a zheap to grow. Instead, the undo grows. When a transaction commits, all row versions that will become dead are in the undo, not the zheap.
This means that there is no need for VACUUM, or any similar process, to scan the table looking for dead rows. The only time the table contains dead rows is when a transaction aborts, and in that case we immediately use the undo to go put the old, still-living row versions back where they were. That process is targeted and does not involve needing to scan the entire table. When a transaction commits, we can eventually discard the undo it generated; this is a bulk operation and can be done very quickly.
Handling indexes is more complex, but we believe that we can also eliminate the need for index vacuuming using the undo infrastructure. That topic, however, would make this blog post far longer than it already is, so I will leave it for another time. There is also a great deal more detail about the design of zheap which I would like to write about, but that, too, will need to wait for another time. This post is intended to explain why we have undertaken this work, not exactly what we are doing.
I realize that there is likely to be a good deal of skepticism among experienced PostgreSQL hackers about the feasibility of making this approach work. I do not claim that we have solved all of the problems, nor that success is assured. A huge amount of work remains to be done. Even if all of that work is successfully completed and even if all of the remaining problems are solved, there will probably still be cases where the existing heap outperforms zheap. That having been said, we have built enough of this new system that parts of it can be tested against the current heap implementation, and those test results look promising. In the near future, we will release the code we have so far under the PostgreSQL license so that the whole PostgreSQL community can look at it, form their own opinions, and run their own tests.
Although I did much of the basic design work, the development lead for this project is Amit Kapila, who has been assisted by Dilip Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, and Beena Emerson. Thomas Munro wrote the undo storage system. Marc Linster has provided unfailing management support, and Andres Freund has provided some design input (and criticism). Thanks to all of them.
Good to know someone is trying to improve it. That's how most other RDBMS do their MVCC, isn't it ? So more data in the shared buffers and in the WAL as a result ?ReplyDelete
See Amit's blog post (also linked above) for some analysis on this topic: http://amitkapila16.blogspot.com/2015/03/different-approaches-for-mvcc-used-in.htmlDelete
This sounds much like Oracle rollback segments to me. As a veteran Oracle user I know the problems with their design (contention and overflow are two of them), anyway I definitely think that moving the garbage outside the heap is a good thing. If you manage to do in place updates you can avoid index maintenance and the amplified writes problem, and foreign key constraint validation can be made much more efficient (today they are all checked during an update even when the update did not touch the foreign key columns!). There is a lot of potential gain from this change.ReplyDelete
That is exactly what I was thinking.Delete
Foreign keys in PostgreSQL are triggers so the MVCC model does not matter to them. PostgreSQL's foreign keys could use some optimizations, but they only check when necessary.Delete
Very good news, looking forward to seeing the follow-up articles.ReplyDelete
When a row gets shorter by N bytes do you hold a space allocation lock of N bytes on the page preventing another TXN's from allocating the space freed up? Otherwise rollbacks could fail.ReplyDelete
- Dan Wood
Yes, that's how it needs to work. I'm not sure we've got all of the implementation details sorted out there just yet.Delete
Have a look at this recent paper that describes different approaches to MVCC for in-memory databases. Some of the issues described in the paper might be applicable to PG as well:ReplyDelete
This sounds like a very welcome improvement! Is any of this discussion online anywhere? I'm curious about some of the implementation details.ReplyDelete
Can the undo log be located on a separate physical disk to avoid some of the performance penalty? Can this buffer handle very large transactions, and if it fills up will it rollback gracefully without downtime?
Is this going to be rolled out alongside the old storage type or is it replacing it?
There is an email with a design proposal somewhere in the pgsql-hackers archives, but not too much else just yet. That will change.Delete
We plan to support allowing the undo to be in a separate tablespace, which could be located on a separate disk.
The buffer will grow as necessary; we don't plan to cap it at some fixed size. So it can handle large transactions as long as you have enough disk space. If you don't, your transactions will start failing.
We plan to propose this as a new, non-default option alongside the old storage type. Even in the unlikely event that this system were better than the old one in all cases and contained no bugs, users running existing releases would still want to upgrade via pg_upgrade.
Would/could you use zheap and heap when creating a tablespace (or even table), like innodb and myisam are chosen when creating a table?Delete
Am I correct that vacuum is still necessary to reclaim free space, as when a row grows shorter (Dan Wood's question above) or is deleted?ReplyDelete
No. If a row grows shorter, then there will be some free space in the block, but VACUUM's job isn't to reclaim space that's already free. VACUUM's job is to create free space by removing dead tuples completely, thus freeing the space they occupied.Delete
It may be advantageous to group transactions in the new structures in a way to make them simpler to clean up than Oracle's. Oracle's method of writing full data blocks and expiring based on time or space usage causes many issues. If logs could be expired as transactions are complete, then the space can be reclaimed ad hoc, rather than en masse, and many of the issues with that design can be avoided. Overall, I find the PostgreSQL way easier to manage and better performing under normal workloads (a few dead rows scattered about converted to free space and reused). Since so much of that block is already in file system cache, using that space rather than writing to somewhere else performs very well, and is space efficient except in the case of mass updates.ReplyDelete
In our system, all the undo data for a transaction is written consecutively into a single transaction log, so it's all in one place. There are multiple logs, but each one is ordered from oldest transaction to newest transaction. We can throw away the oldest transaction in a log if (a) the transaction committed and is now all-visible or (b) the transaction aborted and all associated undo actions have been performed.Delete
That having been said, I'm not sure that the way we're reclaiming space in the current code is optimal, and that may need more work.
"all the undo data for a transaction is written consecutively into a single transaction log"Delete
This reminds me of the RUJ (Recovery Unit Journals) in Rdb/VMS.
please excuse my ignorance in terminology that follows.
This is more exciting than pthreads!
To clarify, zheap will be automatic for the appropriate use cases, and when it isn't appropriate it does BAU meaning you still need Vacuum? Or are you saying zheap will replace everything and no more vacuum for any scenario ever etc.?
Also, is there any theoretical performance improvements (or degradation?) with this approach? greater write performance, or same write performance but no degradation etc.?
Does this enable other use cases/developments more easily in future like IoT, timescale/logs, real-time materialised views, etc?
Even when it does a non-in-place update, undo will still handle the cleanup. You don't avoid bloat in those cases, but you still avoid vacuum.Delete
There are certainly cases that could be worse -- like an in-place update that changes many indexed columns, or a bulk delete. We're just betting that there will be more things that get better than that get worse.
I don't expect a direct benefit to things like timescale DBs, but there's probably a pluggable storage API coming, so that might be useful for such things.
I'm with the group of skeptics. The current sauce works awesomely :-) Surely edge use cases exist or you wouldn't be working on this, but you are asking to change something that has worked and is working beautifully for most use cases.ReplyDelete
Having said that, I'd be OK with this if you made the table format an "opt-in" feature at the table, database and cluster level.
Would zheap speed up counts? I would think so; if that's true, then it'd make sense to use it on for DSS loads.
Yes, this will definitely be "opt in". The current sauce does work well for a lot of people, and we don't have any intention of trying to take that sauce away.Delete
I wouldn't expect counts to get particularly faster. The target use case is workloads where bloat is an issue, or where vacuum/freezing are issues.
I'd like to translate the article http://rhaas.blogspot.jp/2018/01/do-or-undo-there-is-no-vacuum.html into Japanese and publish on our tech blog https://techracho.bpsinc.jp/ for sharing it. Is it OK for you?ReplyDelete
I make sure to indicate the link to original, title, author name in the case.
Please go ahead.Delete
Here is the translation to Japanese, writer translate this and our SE check for the technical accuracy, hope this will increase the excitement in Japan too.
Thank you for the permission!Delete