Thursday, February 13, 2020

Useless Vacuuming

In previous blog posts that I've written about VACUUM, and I seem to be accumulating an uncomfortable number of those, I've talked about various things that can go wrong with vacuum, but one that I haven't really covered is when autovacuum seems to be running totally normally but you still have a VACUUM problem. In this blog post, I'd like to talk about how to recognize that situation, how to figure out what has caused it, how to avoid it via good monitoring, and how to recover if it happens.

Recognizing The Problem

In my experience, problems of this type are typically reported in one of two ways. The worst case is  when a user reports something like:

ERROR: database is not accepting commands to avoid wraparound data loss in database "ouch"

Or alternatively this warning:

WARNING: database "unfortunate" must be vacuumed within 999237 transactions

If you have these symptoms, you definitely have a VACUUM problem of some kind. It might be that you've configured the system in some way that causes vacuum to run too infrequently or too slowly, or it might be that vacuum is getting stuck or failing with some kind of error. The thing to do is run VACUUM - not VACUUM FULL - on the affected database. I recommend that you do not use FREEZE in this situation, but that you do use VERBOSE:


If this completes successfully but fails to fix the problem, then you've got this problem. (If the problem moves to another database, then you probably don't have this problem; you just need to repeat the above step in the other database.) Using VERBOSE is a good idea for a couple of reasons. First, it will print something out after processing each table, so you can tell how quickly you are making progress. Second, it will tell you something about whether it's actually removing any tuples; if it's consistently finding 0 removable tuples even in tables you know have had updates and deletes, that's likely a sign of this issue. Third, even if you can't really interpret the output yourself, it's worth saving it in case you decide to consult an expert.

Sometimes people figure out that they have a problem before things get bad enough to cause warnings or errors. Typically, someone observes that a table has become bloated -- that is, larger than it ought to be given the amount of real data it contains -- and that the situation is not remedied by VACUUM or even VACUUM FULL. If this happens to you, it strongly suggests that you have this problem, although it's not the only possible cause. The common element, in both cases, is that VACUUM is being run successfully but doesn't seem to be accomplishing anything.

Understanding the Cause

Every write transaction executed by PostgreSQL obtains a transaction ID, or XID. In most cases, transaction IDs are stored as 32-bit integers, which means that there are only about 4 billion possible XIDs. (Storing transaction IDs as 64-bit integers in all cases would solve a lot of problems, but also make every tuple 8 bytes larger, so that's not a very palatable solution; there are more clever things that could be done, but they are complicated and out of the scope of this blog post.) Quite a while ago, someone realized that a user might want to execute more than 4 billion write transactions over the lifetime of a PostgreSQL cluster; as hardware has gotten faster and PostgreSQL has not only gotten faster but been adopted more and more widely for mission-critical situations, this has gone from a hypothetical possibility to a routine occurrence.

The PostgreSQL developers decided to handle this problem by allowing XIDs to be reused. It would cause massive confusion if the same XID could refer to multiple transactions at the same time; PostgreSQL skirts that problem by making sure that all references to an XID are removed before it gets reused for a new transaction. Therefore, at any given time, any particular XID can be understood to refer to exactly one transaction (except for 3 special XIDs which have special meanings and are not used for regular transactions). Because of this, when things are operating as intended, we never run out of XIDs: VACUUM removes references to old XIDs quickly enough that we can reuse those same numeric values as new XIDs without confusion and without running out of XIDs.

The difficulty here is that VACUUM cannot remove references to an XID while the XID is still in use. In fact, it's worse: XIDs have a numerical ordering, and if a particular XID is still in use, then not only can VACUUM not remove references to that XID, but VACUUM also cannot remove references to any XID that was assigned after that XID. Therefore, the oldest XID that is considered to still be in use acts as a brake on VACUUM. If the oldest XID that is still in use doesn't change for a prolonged period of time, then the same tables may get vacuumed repeatedly without freeing up any XIDs for reuse. Such vacuuming may still be somewhat valuable for other reasons, like removing tuples created by transactions that went on to abort, and pruning HOT chains, but it will be much less effective than would otherwise be the case, and in some cases it may do nothing at all.

In order to understand precisely how this problem can occur, it is necessary to understand precisely what I mean by "the oldest XID still in use." For this purpose, there are four ways that an XID can be in use:

1. The transaction with that XID is still running. Since it has neither committed nor aborted, it might later create more tuples, each of which would contain a reference to that XID. Therefore, we cannot free up that XID for reuse by another transaction.

2. Another transaction that cares about the status of that XID is still running. This needs to be a transaction that began running before that XID committed or aborted. Such a transaction shouldn't see any database modifications made by the transaction with that XID, which means that we cannot break the link between that XID and the tuples it created or deleted.

3. The transaction with that XID has been prepared using PREPARE TRANSACTION, but neither COMMIT PREPARED nor ROLLBACK PREPARED has been executed yet. In this case, no new tuples with this XID can be created, but until the status of the transaction is resolved, we can't break the link between the tuples created or deleted by the transaction and the corresponding XID.

4. There is a replication slot whose xmin or catalog_xmin is that XID. The replication slot represents a promise to other servers replicating from this machine that we will not remove any tuples with that XID or any newer XID. (In the case of catalog_xmin, the promise only applies to system catalogs, but that detail doesn't matter for our purposes here.)

So, there are three ways that this kind of problem can occur: a very long-running transaction, a prepared transaction which goes unresolved for a very long time, and a replication slot whose xmin fails to advance for a very long time. Fortunately, it's pretty easy to check on each of these things, because the relevant XID values are exposed via system views, and the built-in age() function will tell you how old the XIDs are. It's useful to know that we often use the shorthand "xmin" to refer to the oldest XID that something in PostgreSQL cares about. That said, here are some example queries which you can customize to fit your needs:

-- Check #1 and #2
select age(backend_xid), age(backend_xmin) from pg_stat_activity;
-- Check #3
select age(transaction) from pg_prepared_xacts;
-- Check #4
select age(xmin), age(catalog_xmin) from pg_replication_slots;

If you have this problem, chances are very good that you're going to see a big number in the output of one of those queries. In severe cases, the number will be just under 2 billion. Generally, people don't actually keep a transaction open long enough for 2 billion more transactions to begin afterwards, so the big number is probably going to show up when you query pg_prepared_xacts or pg_replication_slots. If it shows up when you query pg_prepared_xacts, it means that someone prepared a transaction and forgot to COMMIT PREPARED or ROLLBACK PREPARED. If it shows up when you query pg_replication_slots, it probably means that somebody set up a replication slot for a standby and then the standby died or was nuked, and nobody ever cleaned up the replication slot.


Whether or not you have this problem today, I recommend that you set up your monitoring software to run some version of each of the above queries regularly and notify you if you start to see unusually large values. Even though the theoretical limit is around 2 billion, it is a very good idea to catch problems as early as possible. If you typically commit prepared transactions quickly enough that the second query returns values less than a million, then perhaps you should alert if the value goes above 5 million:

select gid, age(transaction) from pg_prepared_xacts where age(transaction) > 5000000;

That way, if something goes wrong and a prepared transaction slips through the cracks, you'll know right away. It's much better to find out about a problem like this after 5 million transactions than after 1.999 billion transactions.

It's also a good idea to keep an eye on the database logs, because they can alert you to a wide variety of problems, including this one. You may need to filter out errors that are expected to result from your application's usage of the database in order to hone in on unexpected errors and warnings, but that's effort well spent.

How To Recover

Recovery from this type of issue is typically a three-step process. Step one is do whatever needs to be done in order to allow the system to reuse some more XIDs. If you have problem #1 or #2 from the list above, use pg_terminate_backend() to kill off the sessions with a high XID age. If you have problem #3 from the list above, use COMMIT PREPARED or ROLLBACK PREPARED to resolve the prepared transactions with a high XID age. If you have problem #4 from the list above, use pg_drop_replication_slot() to drop the replication slots with a high XID age. If you've got more than one of these problems, try to address them all before going further, but in particular, make sure to address the ones where the relevant XID age is highest.

Step two is to VACUUM the affected tables. As discussed in the first section of this blog post, I strongly recommend the use of the VERBOSE option so that you get detailed output, but avoid using FREEZE for the moment, both because it is a lot slower and because it will consume one of your precious few remaining XIDs. If you have done step one correctly, VACUUM should now be able to remove lots of dead tuples that it couldn't remove before, and it should freeze tuples as necessary. You may still have some database bloat, but you shouldn't be getting warnings or errors any more, and the worst of the performance consequences should be mitigated once this step is completed. You may find it helpful to run this query in the affected database:

select oid::regclass, age(relfrozenxid) from pg_class where relfrozenxid != '0';

If, after completing step one and running VACUUM afterwards, you still see values here larger than autovacuum_freeze_max_age -- which by default is 200 million -- then you have probably made a mistake either in completing step one or in running VACUUM afterwards. If not, something very strange is happening, and you may want to seek expert help.

Step three is to address residual table and index bloat. This step is entirely optional. If, after completing steps one and two, you are happy with the performance of your database and its size on disk, then there is no need to do anything further. If not, then you can consider judicious use of CLUSTER, VACUUM FULL, and/or REINDEX to improve the situation. check_postgres includes a query that will give you an approximate idea of how bloated your tables and indexes are, but for various reasons, it doesn't always give accurate results, so you have to take that information with a grain of salt. You can get more reliable information using the pgstattuple extension, by using the pgstattuple_approx() or pgstattuple() functions, or one of the pgstat...index() functions, as appropriate. Note that these methods, although they are more reliable, are also much slower, as they actually read all or part of the table or index, and not just its statistics.

Unlike regular VACUUM, VACUUM FULL, CLUSTER, or REINDEX will block access to your tables, so it's best not to use them indiscriminately. Some amount of bloat is normal, and if you get rid of absolutely all of it, it will just come back as you continue to use your database. Focus on cases that are extreme or that are demonstrably causing performance problems. Also, don't assume that just because you have a performance problem subsequent to a "useless vacuuming" issue, the two are necessarily related. It's not unlikely, but keep an open mind.

Hope this helps!

No comments:

Post a Comment