In a recent blog post, I talked about why every system that implements MVCC needs some scheme for removing old row versions, and how VACUUM meets that need for PostgreSQL. In this post, I’d like to examine the history of VACUUM improvements in recent years, the state of VACUUM as it exists in PostgreSQL today, and how it might be improved in the future.
When I first began using PostgreSQL, autovacuum did not exist, and I was not aware of the need for manual VACUUM. After a few months (!), I wondered why my database was so slow. Putting a vacuumdb command in cron, scheduled to run every 6 hours, was sufficient for my needs at the time, but it only worked because my database was small and handled a limited amount of traffic. In many environments, UPDATE and DELETE operations will target some tables much more often than others, and therefore some tables will accumulate dead row versions much more quickly than others, and therefore the interval between one VACUUM and the next should also vary. If a user with this kind of environment were to run a full-database VACUUM frequently enough to meet the needs of their most heavily-updated tables, they would VACUUM lightly-updated tables far more often than necessary, wasting effort. If they were to reduce the frequency of the full-database VACUUM to avoid wasting effort, heavily updated tables wouldn’t get vacuumed often enough and their on-disk size would grow as they filled up with dead row versions, otherwise known as “bloat”.
PostgreSQL 8.3 was the first release to feature a reasonably modern autovacuum. For the first time, autovacuum was enabled by default. It also featured a multi-process architecture, which meant that more than one table could be automatically vacuumed at the same time. And, of course, autovacuum takes care of figuring out which tables are due to be vacuumed, largely (though not completely) eliminating the need to schedule vacuums manually. This was a huge step forward; for the first time, not every user of PostgreSQL needed to worry about configuring VACUUM when they configured PostgreSQL. On a good day, it worked perfectly out of the box.
PostgreSQL 8.4 featured two massive further improvements. In older releases, there was a fixed-size free space map which could be resized only by changing a configuration parameter and restarting the server. If the number of pages with free space in the database exceeded the configured size of the free space map, the server would lose track of some pages containing free space, typically causing runaway database bloat. PostgreSQL 8.4 introduced a new, dynamically sized free space map which never lost track of free space. It also added a visibility map, which allowed VACUUM to avoid scanning unchanged portions of the table all over again during each new VACUUM. It was, however, still necessary to scan the entirety of every index every time, and it was periodically necessary to scan the whole table in order to prevent transaction ID wraparound. VACUUM for the purpose of preventing transaction ID wraparound is typically rare compared to ordinary VACUUM; the former is triggered, by default, every 200 million write transactions, while the latter is triggered, by default, when the number of dead tuples in the table exceeds 500 + 20% of the estimated table number of tuples in the table. In each case, these values can be changed by adjusting configuration parameters.
Several of the improvements since PostgreSQL 8.4 have focused on trying to prevent VACUUM from getting stuck. The basic problem, which exists in multiple variants, is that if an autovacuum process waits for some lock which doesn’t become available for a long time, then (1) the table that it is trying to VACUUM doesn’t get vacuumed by that process or any other, since only one process can vacuum a table at a time and (2) there is one fewer autovacuum worker available to run VACUUM on any other table in the system, which may mean that other tables don’t get promptly vacuumed either. Both problems can lead to bloat. PostgreSQL 9.1 improved things a bit by teaching autovacuum workers to skip tables that were due for vacuuming but on which they couldn’t immediately get a relation lock; since autovacuum retries every minute, this doesn’t hurt anything. PostgreSQL 9.2 improved things so that the system skipped individual table blocks on which it could not immediately obtain a cleanup lock, except for cases where the block contained tuples that needed to be removed or frozen and the vacuum was for the purpose of preventing transaction ID wraparound. PostgreSQL 9.5 reduced the number of cases in which btree index scans retain a pin on the last-accessed index page, which eliminates most cases of VACUUM getting stuck waiting for an index scan. As this list of improvements makes clear, the problem of index scans getting stuck has not been entirely solved, but we have steadily whittled away at the list of cases where it happens.
There have also been a few improvements that aim to reduce the number of times that VACUUM scans each heap page. Prior to the PostgreSQL 9.3, a table page that was not modified between one VACUUM and the next would be marked all-visible by the second VACUUM, and skipped by future VACUUMs, except those triggered to prevent transaction ID wraparound. In PostgreSQL 9.6, it became possible to skip pages even during a VACUUM triggered to prevent transaction ID wraparound. To do this, the visibility map was further improved to keep track not only of whether pages were all-visible -- that is, known not to contain any dead row versions -- but also whether they were all-frozen -- that is, known not to contain any tuples at risk of eventual transaction ID wraparound. Pages in the latter category can be skipped unconditionally; they cannot be interesting to VACUUM for any purpose.
In addition to what is mentioned above, there has been a steady stream of logging improvements and a few other efficiency improvements over the years.
What is left to be done? The PostgreSQL development community has made great progress in reducing the degree to which VACUUM performs unnecessary scans of table pages, but basically no progress at all in avoiding unnecessary scanning of index pages. For instance, even a VACUUM which finds no dead row versions will still scan btree indexes to recycle empty pages. Efforts are underway to improve this, but they have stumbled over some details of how btree indexes interact with transaction ID wraparound. It would also be desirable to improve the behavior in the case where there are a few, but not very many, dead row versions. If a 1TB table contains 10 dead row versions, especially if they are all on the same page, scanning the entirety of every index in order to remove index pointers for those row versions likely doesn’t make sense. This problem is, however, mitigated by the fact that autovacuum generally won’t trigger in the first place in cases where there are only a few dead row versions.
VACUUM remains a bulk operation. In some ways, this is good, because batching large numbers of related operations together into a single bulk operation tends to be more efficient than performing them one by one. However, it also means that when VACUUM engages, especially after a long period of time during which it has not run, the resource utilization can come as a surprise, and on large tables, it can last for a long time. It seems worth thinking about whether the VACUUM of a large table could be split into a series of smaller steps.
The default settings for cost-based VACUUM delay are too small for large databases. By default, autovacuum runs with a 20ms cost-delay, and the default settings of vacuum_cost_limit, vacuum_cost_page_dirty, and vacuum_cost_page_miss limit VACUUM to about 28.8 GB/hour when scanning table or index pages that do not need cleaning and about 9.6 GB/hour when scanning table or index pages all of which need cleaning. For tables that are hundreds of gigabytes or terabytes in size, this can make VACUUM take too long; if the table is due to be vacuumed again before VACUUM completes, it will bloat.
Also, VACUUM has no ability to tailor its activity based on system busy and idle periods, because it has no knowledge of when those periods occur. Ideally, a system that is otherwise idle would perform any vacuuming that is in progress, or even due soon, at maximum speed, and then proceed more slowly when the system is under a foreground load. However, there’s no scheduling system to make such a thing happen; if you want this type of behavior, you must configure it manually.
If I were to summarize all of the above in one sentence, it might be this: VACUUM has come a long way, but it still needs monitoring and management. Small installations - like the first ones I used - running a current version of PostgreSQL may never need to do anything about VACUUM at all, and even larger installations will typically experience fewer problems today than they would have with earlier releases. At the same time, for large installations requiring consistently good performance, VACUUM typically requires at least some monitoring, management, and tuning. Time will tell what improvements future releases will bring.
Or maybe, Postgres can instead store row versions outside the database like SQL Server which does not need VACUUM at all. Row versions are just append to a log like structure in a temporary database. The log is bulk trimmed from behind based on the oldest active transaction. The main drawback here is that long running transactions prevent shrinking the log. This effect is negligible in practice.ReplyDelete
Patch by Yury Sokolov in thread you mentioned fixed the btree vacuum scan issue for our use case. It would be cool if it gets some review and gets committed. :)
Hi Robert, Thank you that helps in one place to see the improvements. One area I see lagging attention is tables which are not touched at all (no insert.update/delete) performed still need to be perform vacuum just for age reasons. Understand with improvements its fast, but still seems unnecessary work for system specially for analytical ones. If someway can eliminate the need just for age reasons to vacuum the table would be greatly helpful.ReplyDelete
I tried to think a lot for solution to it but couldn't find the place where table level dirty or not compared to previous run of vacuum can be maintained to eliminate the need.
This is one of the things we're hoping to fix via the zheap project.Delete