Thursday, November 10, 2011

Unsticking VACUUM

Every PostgreSQL release adds new features, but sometimes the key to a release has less to do with what you add than with what you take away.  PostgreSQL 8.4, for example, removed the settings max_fsm_pages and max_fsm_relations, and replaced them with a per-relation free space map that no longer requires manual sizing.  Those parameters are now gone, and more importantly, something that you previously needed to understand and manage was replaced with something that just works.   People who are still running PostgreSQL 8.3, or older versions, want to understand exactly how the free space map works; people who are running PostgreSQL 8.4, or newer, don't care.  It's enough to know that it does work.

Now, about eight months ago, I wrote a blog entry on troubleshooting stuck vacuums.  I would not say that this is an everyday problem, but in ten years of working with PostgreSQL, I've seen it a few times, and it's very unpleasant.  It's easy to miss the fact that you have a problem at all, because in most cases, nothing immediately breaks.  Instead, system performance just slowly degrades, gradually enough that you may not realize what the problem is until things have gotten pretty bad and you need to CLUSTER or VACUUM FULL to recover.

In PostgreSQL 9.1, I improved the case where an autovacuum worker gets stuck waiting for a table lock.  VACUUM can run concurrently with selects, inserts, updates, and deletes, but not with other VACUUM operations or with DDL.  Manually locking the table, using the LOCK TABLE command, can also prevent VACUUM from processing the table until the transaction commits and the lock is released.  So, starting in PostgreSQL 9.1, unless it's performing an anti-wraparound vacuum, an autovacuum worker won't wait for a table lock: it will just skip the table and retry later.  That way, the worker process can go vacuum some other table, instead of sitting there doing nothing.  This prevents so-called "vacuum starvation", where there aren't enough autovacuum workers to process all the tables that need it.

Simon Riggs wrote a patch, which barring any unexpected difficulties will be in PostgreSQL 9.2, that should improve things even more.  At least in my experience, the most common thing that makes autovacuum get stuck is not so much trying to get the table lock (although that can happen) but waiting for a cleanup lock on an individual data block.  If, for example, someone leaves a cursor open, the data block to which the cursor refers is busy (the technical term is "pinned") and can't be vacuumed until the cursor is closed or moved to a different block.  The patch does two things to make this problem less likely.  First, if we're not doing an anti-wraparound vacuum, and the cleanup lock isn't immediately available, we just skip the block.  No big deal: the next vacuum will fix it.  Second, even if we ARE doing an anti-wraparound vacuum, but we can see that the block doesn't contain any tuples that need anti-wraparound processing, we can still skip the block.

While this isn't a totally bullet-proof solution, I think it's going to be a significant improvement over the status quo ante.  This is already a fairly uncommon problem, and anti-wraparound vacuums generally happen quite a bit less frequently than regular vacuums, and even an anti-wraparound vacuum has a chance of dodging the problem if the busy block doesn't happen to require anti-wraparound processing.  Of course, the usual advice to avoid leaving cursors (or, more generally, transactions) open is still good advice; but anything we can do to make following that advice less important is a step forward.  Perhaps in the future we'll be able to completely eliminate this problem, but this is a good start.

No comments:

Post a Comment