Monday, May 11, 2020

Don't Manually Modify The PostgreSQL Data Directory!

I was lucky enough to get a chance to give my talk Avoiding, Detecting, and Recovering From Data Corruption at PGCONF.IN in February, before everything got shut down. The conference organizers did an amazing job with the video, which shows both me speaking and the slides I was presenting side by side. That's the first time a PostgreSQL conference has done the video that way, and I love it. One of the points that I raised in that talk was that you should not manually modify the contents of the PostgreSQL data directory in any way. To my surprise, the most frequent question that I was asked after giving the talk was "Really? What if I do XYZ?"

In 100% of cases, my answer was "that's not safe." However, it was hard to avoid that conclusion that this topic is not well-understood, so I'd like to write a little more about it. The important thing to understand is that, with the exception of a few text configuration files like postgresql.conf and pg_hba.conf, PostgresSQL manages all the files in the data directory. It does not expect you to modify them, and it may break if you do. Let's examine a couple of specific scenarios.

Removing postmaster.pid

If you remove the postmaster.pid file, you may end up with two copies of PostgreSQL reading from and writing to the data directory at the same time, each without the knowledge of the other. Your database will probably get really badly corrupted, even if one of the copies is only used for reads. This situation must be avoided at any cost.

Typically, the reason users are tempted to do this is that PostgreSQL thinks that there is another copy of the server running, and therefore won't start. It is easy to come to the conclusion that PostgreSQL is mistaken because there cannot possibly be another copy of the database running, but the way that PostgreSQL performs these checks is pretty smart, and it's usually right. I think that, in the 20 or so years that I've been regularly using PostgreSQL, I've seen it make this mistake once. That was many years ago, and the logic has been improved since then. So, I recommend great caution in coming to the conclusion that PostgreSQL is wrong and you are right.

A real mistake on the server's part using a recent release of PostgreSQL is not impossible, but it should be vanishingly rare in practice. If you see it happening, something's wrong, and you should try to understand what is happening, not just blast through it by nuking the PID file.

Removing WAL Files

Sometimes, users see a large amount of WAL accumulating inside the pg_wal directory (in older versions, pg_xlog) and this becomes a problem. To solve the problem, they try removing the older files. What is sometimes overlooked is that PostgreSQL also has logic for removing older WAL files; it is designed not to retain them longer than necessary. If you remove them before it thinks that they are unnecessary, bad things can happen.

For instance, if you remove the WAL file that contains the most recent checkpoint record, or any newer one, and then either PostgreSQL or the operating system crashes, crash recovery will fail. At that point, you will have little choice but to use pg_resetxlog, resulting in permanent database corruption.

If you remove a WAL file that a standby is going to download using streaming replication or which another server is going to decode using logical replication, replication is going to break. You'll have to rebuild the standby, or resynchronize the tables and restart logical replication, in order to recover.

If you remove a WAL file that was being preserved because it has not yet been archived, it will never get archived. Any point-in-time recovery that needs that WAL file from the archive, or any standby that needs to restore that file from the archive, will fail.

Instead of removing files manually from pg_wal, consider investigating why the system is preserving so many files there in the first place. Perhaps you have configured an archive_command that is failing repeatedly; fix it, and also, monitor it so you notice if it starts failing again in the future. Perhaps it is succeeding but not keeping up with the rate of WAL generation; if so, make it faster. Perhaps you have an old replication slot that is preventing the removal of old WAL segments; if so, maybe you should drop it, or ensure that the machine for which the slot was created connects and catches up. Either will allow the automatic removal of WAL files. Perhaps you have max_wal_size or wal_keep_segments set to an inordinately high value; consider reducing it.

The advantage of these approaches - aside from being safer - is that they fix the problem at its root. If your system is configured in some way that causes pg_wal to become very large and you remove files from pg_wal to address that issue, the problem is just going to come back.  If, for example, archive_command is failing, PostgreSQL will never remove any old WAL files. Even if you manage to remove some of them yourself without breaking anything, you're going to have to keep doing it, because the files will keep accumulating. You will be far better off fixing your archive_command, or disabling archiving. Then, PostgreSQL will take care of it for you, as it is designed to do.

Removing files from pg_xact (in older versions, pg_clog)

These files contain critical data. They keep track of which transactions committed and which transactions aborted. As with pg_wal, files that aren't needed any more are automatically removed, so the fact that a file has not been automatically removed means that the system still thinks that it needs it. In theory, it could also mean that the removal process is failing for some reason, like a permissions problem on the data directory, but I think that is pretty rare.

The gating factor for removal of files from these directories is the oldest transaction ID in the system that is still referenced from one of your tables. Suppose one of your tables contains a tuple that was created by transaction 123456. Suppose that you then remove the pg_clog file that contains the information about whether transaction 123456 committed or aborted. When you later try to access the tuple in question, for example by selecting from the table, it's going to encounter that tuple, see the associated transaction ID, and try to look it up. It needs to look it up because the tuple should be visible to your SELECT statement if the creating transaction committed, but not if it aborted. If the pg_clog file is missing, the system cannot determine whether the tuple should be visible, and you will get an error when you try to access your data. That's not good.

The space consumption of this directory should be fairly minimal, so I don't really understand the temptation to remove these files, but if you want to reduce the number of files the system is storing, the thing to do is run VACUUM FREEZE on all of your databases, to remove references to old transaction IDs. Long-running transactions, prepared transactions, and replication slots can prevent this from being effective, so make sure to check for those.

Removing files from pg_multixact

This is one I haven't actually heard about anybody doing, but the issues are just the same as for pg_xact. The only difference is that here we're talking about MultiXactIds rather than transaction IDs. But, as before, VACUUM FREEZE can be used to create clean-up opportunities, and as before, manual removal of files is likely to create data that provokes an error when you try to access it.

Single-Page Restores

One way that you might think about trying to recover from a corrupted database page is by going to a recent backup, finding a copy of the page that is not corrupted, and copying it into the current version of the database at the correct location. This is not totally unreasonable as a corruption recovery strategy, but there are several pitfalls which may not be obvious:

- If you do this while the database is running, and the database already has the page cached, it may not see the restored page image, and may even overwrite it if the cached copy is modified.

- Your single-page restore may create a state where the table is inconsistent with its indexes, or where the table is inconsistent with its TOAST table. In the former case, you may get different answers to the same query depending on whether or not it uses an index. In the latter case, you may get errors when trying to access your data.

- Suppose that the table has been vacuumed since the backup was taken. In that case, it's possible the system's notion of the oldest relevant transaction ID or the oldest relevant MultiXactId present in the table have advanced. If so, pg_class.relfrozenxid and/or pg_class.relminmxid for the target table will have changed. It is also possible that the restored page image contains transaction IDs or MultiXactIds that are older than the new horizons. If so, restoring that page image into the cluster may create rows that can't be accessed without an error. Worse still, things may look OK when the page is first restored but start failing later - possibly weeks or months or even years later - when the system performs a clean-out of pg_xact or pg_multixact. In addition, in newer major and minor releases, VACUUM itself may fail internal sanity checks, resulting in a variety of problems, including bloat and transaction ID wraparound. In older releases that don't have these sanity checks, VACUUM might instead silently corrupt the page.

All the caveats from this section apply equally to restores of multiple data pages, like copying an entire relation file from a backup into the current version of the database. (Copying an entire relation file from one database to another is, as a general rule, even crazier than what is described here and likely to end very badly, but I will omit a detailed discussion of the issues here to avoid making this post even longer.)

Is Manual Modification of the Data Directory Ever Justified?

Sadly, I can't answer "no" to this question. There are circumstances under which there is no reasonable alternative.

For example, if PostgreSQL crashes during a transaction which performed a CREATE TABLE statement, the files created by that statement will not be removed upon restart, except in the special case where the table is temporary. If you need to reclaim the disk space used by those files, there's little choice other than to manually remove them. Hopefully, we'll come up with a better solution to this problem at some point.

Also, in scenarios where your database is already corrupted, manual modification of data files is sometimes the only reasonable way to try to recover. In this sense, it is like surgery: most human beings prefer not to be cut open if there is any reasonable alternative, and there is always some associated risk and unpleasantness, but sometimes it is the best option anyway, because the alternatives are worse.  It's a good idea, though, to make sure that your surgeon is well-qualified.

Summary

What I hope you will take away from this relatively brief discussion of the subject is:

1. As far as possible, you should try to accomplish your goals by changing the database configuration or by executing SQL commands within the database, rather than by changing things at the filesystem level. This is how the system is designed to be used, and is therefore safer. It often has other advantages as well.

2. If you are considering manual modifications, you should be aware that there may be complex relationships between the file you're thinking of modifying and other files that are managed by PostgreSQL. Unless you know all of the invariants that the system expects will hold true, and can guarantee that the proposed modification will not break any of them, you are playing with fire.

3. Just because you modified the data directory manually and nothing broke immediately, it doesn't mean that your modification was safe. Problems may take days, weeks, months, or years to surface, and may not be obviously connected to the modification you made.

4 comments:

  1. How about `pg_internal.init` files ?

    We have had to delete this file once in a while (similarly to this thread https://www.postgresql.org/message-id/flat/15665.1527284857%40sss.pgh.pa.us#a82041501b82c3e2fce488331cbf67f7)

    ReplyDelete
    Replies
    1. You should not need to manually remove that file "once in a while." The system is supposed to take care of that. If it's not, we have a bug.

      Unlike some other files, this one is intended to be automatically regenerated by the system; it's a cache. So if you do this while the system is not running, for example, I don't think you'll break anything. Even if the system is running, I think it probably won't break anything. It'll just get rebuilt.

      But if you are doing it regularly I'd say that's not good. If it seems to be necessary for some reason then I'd say something is wrong, either with your setup or in the source code.

      Delete
  2. I understand, the Single page restore will not work while database is up and running but how about restoring the non corrupt relation file and then restart the instance for the recovery using WAL archives to bring the cluster to latest consistent state. Here I am trying to correlate with tablespace recovery in Oracle.

    ReplyDelete
    Replies
    1. Of the three issues I mentioned, the first one does not apply if the cluster is shut down, but the other two certainly do.

      I don't believe that there's any safe procedure that involves shutting the system down, replacing a page, and then doing a recovery to somehow fix things up. Point-in-time recovery is supported, but only on a cluster-wide basis and following the directions in the documentation.

      Oracle may have some features that allow for recovery on a finer granularity than the whole database cluster -- I don't know, I'm not an Oracle expert -- but PostgreSQL does not have any such features and the only safe way to perform a recovery is on a full-cluster basis.

      Delete