Tuesday, March 04, 2014

VACUUM FULL doesn't mean "VACUUM, but better"

There's a persistent belief among some users of PostgreSQL that VACUUM and VACUUM FULL do the same thing, but that VACUUM FULL does it better.  If VACUUM is the moral equivalent of running the Dust Buster across the room a few times, VACUUM FULL must be the equivalent of hiring a professional cleaning crew to shampoo the carpets, and maybe repaint the walls as well.  Unfortunately, this mental model is not accurate.

Here's a better one.  VACUUM is the equivalent of going into each room of your house, finding all of the stuff in there that you don't need any more, and throwing it out.  This is important because, if you don't do it, your house will eventually fill up with stuff, much of which will be garbage.  As you continue to acquire new things without getting rid of any of the old ones, you're eventually going to need a larger house, or an annex.  VACUUM FULL, on the other hand, is the equivalent of taking all of the stuff from your house that you still want, loading it into a moving van, buying the smallest house that suffices to contain the contents of the moving van, and then burning down the old house.  If there's very little that you actually want to keep, and a vast amount of junk that you want to get rid of, the VACUUM FULL procedure may actually be more efficient.  But in most cases, VACUUM is the way to go: it's usually less work, and the house can still be used while it's being cleaned.  If you find yourself using the VACUUM FULL procedure with any regularity, something is wrong.

In slightly more technical terms, whenever a transaction performs an update, we add a new copy of the row.  If that transaction goes on to commit, the old row should be removed; if it aborts, the new row should be removed.  Similarly, if a transaction inserts a row but goes on to abort, the row must be removed.  If it deletes a row and goes on to commit, the row is only marked for removal, and must afterwards be removed for real.  VACUUM does all of these things.  It also shrinks the relation on disk if there happens to be free space at the end of the file, but in general if it removes a large number of dead tuples, there may be significant internal free space left behind in the file.  That's not all a bad thing, because future updates and inserts are going to need a place to put their data, and using already-allocated space is more efficient than getting new space.  But if there's an excessive amount of preallocated space, and you need to return some of it to the operating system to improve sequential scan speed, or make room for other relations to grow, then you need VACUUM FULL, which will scan the relation for tuples that are NOT dead, write them into a new file, and remove the old file.  If you're doing this regularly, then you should try to make your system's autovacuum settings more aggressive so that the problem doesn't arise; that is, you need to throw away garbage more quickly so that the house doesn't fill up and require expansion.


There's one situation in which it's particularly critical to use VACUUM rather than VACUUM FULL.  In addition to its routine garbage collection duties, VACUUM is also responsible for preventing a phenomenon called transaction ID wraparound, which is discussed in the PostgreSQL documentation under the heading Preventing Transaction ID Wraparound Failures.  Beginning in 9.3, it must also prevent the related phenomenon of multixact ID wraparound.  Normally, this is all pretty transparent, but sometimes (particularly with a poor selection of vacuum-related settings) your system will begin to emit warnings or even shut down because data loss is imminent and vacuum must be completed without further delay.  In current releases, this problem is only addressed by VACUUM without the FULL option.  VACUUM FULL will not forestall transaction ID wraparound and should not be used in circumstances where transaction ID wraparound is a risk.

PostgreSQL 9.4 will make some improvements in this area.  Beginning in 9.4, VACUUM FULL (and CLUSTER, which uses much of the same code) will forestall transaction ID wraparound.  This is a significant improvement, because right now, you might use VACUUM FULL or CLUSTER to rewrite a relation for other reasons (e.g. to remove dead tuples, or in the case of CLUSTER to improve query performance) and then discover that you've got to run a regular VACUUM on the table shortly thereafter for wraparound purposes, rewriting all the data again.  These changes will eliminate that possibility.  However, the recommended recovery strategy for severe cases of transaction ID wraparound will remain VACUUM rather than VACUUM FULL.  This is not only for reasons of efficiency, but also because VACUUM FULL consumes a transaction ID, while plain VACUUM does not.  When only a few transaction IDs remain before a wraparound event, it is best to economize, especially if the database in question contains many tables; when VACUUM FULL is applied to multiple tables, one transaction ID will be used for each.

(Historical Note: Prior to PostgreSQL 9.0, VACUUM FULL worked very differently; the information in this blog post does not apply to older versions.)

14 comments:

  1. Also, a VACUUM FULL destroys any existing free space map and/or visibility map related to the relation. You need to follow the VACUUM FULL with a regular VACUUM if you want the benefit of those -- like index-only scans. It is quite possible for access to the table to be slower after VACUUM FULL until autovacuum gets around to fixing things up.

    ReplyDelete
  2. After a large data conversion where every row in the database had several changes, or maybe most were deleted, vacuum full is nice, but for normal operations, not only is regular vacuum quicker, but since the space is already allocated, insert performance is also much better.

    ReplyDelete
  3. This should be mandatory reading for many of our support customers. VACUUM continues to be a cause of much confusion.

    ReplyDelete
  4. Great post!
    Quick question - from my tests it seems like VACUUM FULL doesn't upgrade dead_rows count, while simple VACUUM does. How come?

    ReplyDelete
  5. I might still be confused. Is it ever dangerous or problematic to use vacuum full? I assumed it was up to the consumer to decide whether or not they wanted to use it, but your explanation makes it sound a lot more intensive.

    Jenn | http://www.langleyvacandsew.com/en/

    ReplyDelete
    Replies
    1. It's not dangerous or problematic; it's just not always the best tool for the job.

      Delete
  6. After running VACUUM FULL we've got an extra 30GB of available disk space, database was using 45GB and is now 15GB! Using VACUUM on it's own made no difference.

    ReplyDelete
    Replies
    1. That's sort of the point he was trying to make, I think. If you have a system that has already not been vacuumed often enough (maybe autovacuum isn't keeping up, or is off), "dead" space will continue building up over time, and eventually you'll have to use a "vacuum full" to clear it. If vacuum was clearing the dead row space fast enough, you likely wouldn't have gotten anywhere near 30 GB of dead space, and wouldn't have needed vacuum full. So a regular vacuum on a "regular" basis is much more stable, or just tune your autovacuum settings.

      Delete
  7. Is it possible if database rebooted during VACUUM FULL operation will leave lot of orphaned data files and hence effectively increasing the database foot print in terms of disk space?

    ReplyDelete
    Replies
    1. Yes, database crashes can leave behind orphaned files. My colleague Thomas Munro is working on that problem: https://commitfest.postgresql.org/21/1863/

      Delete
  8. I am currently researching ways to ensure, data is not only removed from the database but also from the file system. The goal is to comply with the right to be forgotten (GDPR).

    As stated in various publications (e.g. https://www.infosun.fim.uni-passau.de/publications/docs/GSK13.pdf), this is neither possible to ensure by a simple DELETE nor a VACUUM statement within PostgreSQL.

    After reading your (very well written) blog post I now think, that this might indeed be possible by a VACUUM FULL. If it writes a completely new file, leaving out unused data, there should be no more previously deleted data in the new file. Is that correct?

    Could this be a case where running VACUUM FULL on a regular basis might make sense?

    ReplyDelete
    Replies
    1. Tuples that are not visible to anyone will be omitted from the new file, but recently deleted tuples might still be visible to transactions that began before the DELETE committed. Such tuples will be copied to the new file.

      Delete