Robert Haas: Do I Really Need That backup

I'm sure you already know what I'm going to tell you: "Of course you need that backup_label file. How could you even think that you don't need that backup_label file?" Well, you're right. That is what I'm going to say. But do you know why you need that backup_label file? If you were to remove that backup_label file (or fail to create in the first place, in cases where that is your responsibility), what exactly is the bad thing that would happen to you?

A backup_label file is generated for you automatically when you run pg_basebackup. If you use pg_backup_start() and pg_backup_stop() to perform a backup, then pg_backup_stop() will return the information that needs to be written to the backup_label file, and you're responsible for actually creating the file and putting those contents into it. The documentation cautions that this file is "vital to the backup working and must be written byte for byte without modification," which certainly sounds pretty alarming.

Let's experiment and find out. I spun up a database cluster, took a backup using pg_basebackup -cfast -Dbackup -R, made a copy of the backup, and removed the backup_label file from the copy. Now let's try starting up the postmaster on the original backup (where the backup_label file is present) and from the copy (where it is absent) and see what happens. Here's what it looks like with the backup_label:

LOG: starting PostgreSQL 14.8 on x86_64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit

LOG: listening on IPv6 address "::1", port 5414

LOG: listening on IPv4 address "127.0.0.1", port 5414

LOG: listening on Unix socket "/tmp/.s.PGSQL.5414"

LOG: database system was interrupted; last known up at 2023-05-25 12:56:12 EDT

LOG: entering standby mode

LOG: redo starts at 0/7000028

LOG: consistent recovery state reached at 0/7000100

LOG: database system is ready to accept read-only connections

And here's what it looks like without the backup_label:

LOG: starting PostgreSQL 14.8 on x86_64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit

LOG: listening on IPv6 address "::1", port 5414

LOG: listening on IPv4 address "127.0.0.1", port 5414

LOG: listening on Unix socket "/tmp/.s.PGSQL.5414"

LOG: database system was interrupted; last known up at 2023-05-25 12:56:12 EDT

LOG: entering standby mode

LOG: database system was not properly shut down; automatic recovery in progress

LOG: redo starts at 0/7000028

LOG: consistent recovery state reached at 0/8000000

LOG: database system is ready to accept read-only connections

There are two differences between the first bit of output and the second. In the first case, the database system says "database system was interrupted" and in the second case it says "database system was not properly shut down." That doesn't really matter. Secondly, in the first case, it says that a consistent recovery state was reached at 0/7000100, while in the second case, it says that a consistent recovery state was reached at 0/8000000. It turns out this doesn't really matter, either. Let's look at the WAL. In relevant part:

rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn: 0/070000D8, prev 0/07000060, desc: BACKUP_END 0/7000028

rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: 0/07000100, prev 0/070000D8, desc: SWITCH

In the case where the backup_label file was present, we reached consistency after replaying the BACKUP_END record. When it wasn't present, we reached consistency after replaying the SWITCH record. But these records are only one apart in the log file -- the SWITCH record takes up everything from 0/07000100 through 0/8000000 -- and, crucially, the SWITCH record is after the BACKUP_END record. What's really important here is that all the WAL generated during the backup gets replayed before reaching consistency, and that is happening in both cases. Redo starts, in each case, at 0/7000028, and when I do the recovery properly, I reach consistency at 0/07000100, meaning that all of the WAL up to that point has been replayed. If I instead reach consistency at 0/8000000, I've replayed all that same WAL and more, but replaying extra WAL is not going to break my database. What would break my database is if I were to replay too little WAL, and that's not happening here.

I'm not doing a very good job convincing you that this file is important, am I? I think that the kind of scenario that I've shown above is exactly why so many people erroneously conclude that this file isn't really important. After all, I've taken a backup in a perfectly correct and natural way, restored in a way that the documentation clearly says is wrong, and my database is 100% fine. There is no lingering corruption anywhere and I have lost no data. Apparently this whole "vital to the backup working" narrative is pure fiction, right?

Well, actually, no. What happened is that I got lucky. Let's look at an example where things don't go as well. I took a new backup in almost the same way as the previous one, and here's what the log looks like this time, with the restore done properly, i.e. backup_label is present:

LOG: starting PostgreSQL 14.8 on x86_64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit

LOG: listening on IPv6 address "::1", port 5414

LOG: listening on IPv4 address "127.0.0.1", port 5414

LOG: listening on Unix socket "/tmp/.s.PGSQL.5414"

LOG: database system was interrupted; last known up at 2023-05-25 14:51:46 EDT

LOG: entering standby mode

LOG: redo starts at 9/28152B78

LOG: consistent recovery state reached at 9/86ACE058

LOG: database system is ready to accept read-only connections

And here's what it looks like if I start postgres on a copy of the same data directory after first removing the backup_label file:

LOG: starting PostgreSQL 14.8 on x86_64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit

LOG: listening on IPv6 address "::1", port 5414

LOG: listening on IPv4 address "127.0.0.1", port 5414

LOG: listening on Unix socket "/tmp/.s.PGSQL.5414"

LOG: database system was interrupted; last known up at 2023-05-25 14:51:46 EDT

LOG: entering standby mode

LOG: database system was not properly shut down; automatic recovery in progress

LOG: could not signal for checkpoint: checkpointer is not running

LOG: redo starts at 9/490028E8

FATAL: could not access status of transaction 6388919

DETAIL: Could not read from file "pg_xact/0006" at offset 16384: read too few bytes.

CONTEXT: WAL redo at 9/4900D0B8 for Transaction/COMMIT: 2023-05-25 14:51:12.145629-04

LOG: startup process (PID 96689) exited with exit code 1

LOG: terminating any other active server processes

LOG: shutting down due to startup process failure

LOG: database system is shut down

Something has gone terribly wrong in the second case, because the database server emitted a fatal error and shut down instead of starting up properly. That's actually lucky, too: in my experience, it's quite common for the database server to start up just fine in this scenario, with corruption only being discovered much later. Here, the database server itself has realized that something has gone terribly wrong and has been good enough to inform us that we have fouled something up somehow. But what exactly has gone wrong?

The key thing that you should notice here is that in the first restore, we got redo starts at 9/28152B78 whereas in the second one, we got redo starts at 9/490028E8. Remember how I said before that we'll have problems if we replay too little WAL? Well, that's exactly what has happened here. The most important thing that the backup_label file does is tell the server where to start WAL replay when recovering from the backup. In this case, the backup_label file says that WAL replay should begin at 9/28152B78. Without the backup label file to tell the server what to do, it uses the information in the control file to figure out where to start WAL replay, and the control file says 9/490028E8. So removing the backup_label file causes redo to begin at a later point in the WAL stream than it really should, and then we have problems.

The reason why our first test worked OK is that, in that case, the backup_label file and the control file happened had the same idea of where WAL replay should start. Therefore, it didn't matter too much which one we used. But when the values are different, it's critical that the value from the backup_label be used, rather than the value from the control file. PostgreSQL continues to write to the data files while a backup is in progress, and that means that the only way to make the cluster consistent is to replay all of the WAL that the system generated during the backup. The backup_label file knows the point in the WAL stream at which the backup started, and so it indicates that replay should begin at that point. The control file, on the other hand, only knows when the most recent checkpoint happened, and that might be later than the start of the backup.

And that brings us to the question of how these two tests were different. In the first test, the system was completely idle and the database contained very little data. A checkpoint always occurs at the start of the backup, and because the backup completely very quickly and the system was idle, it finished before the next checkpoint happened. In the second case, I added a bunch more data to the database, ran pgbench in the background to trigger frequent checkpoints, and slowed down pg_basebackup using the --max-rate option. By doing so, I was able to get a backup that had a checkpoint in the middle of it. Now, the control file said that replay should start from that extra checkpoint happened in the middle of the backup, but the backup_label said that replay should start from the checkpoint that happened at the beginning of the backup. The former would be the correct thing to do if the primary had crashed, but the latter is the right thing to do if you're trying to recover the backup.

Even if the backup_label file and the control file agree on where to start replay, there are other, subtler ways that you can hose yourself by removing the backup_label file, so please don't do that even if you are certain that your backups are entirely checkpoint-free. But I believe that starting redo at the wrong WAL location is probably the most common way that things go wrong, and if that happens, you have a corrupted database. The only question is whether it's obviously corrupted (as here) or whether the damage is more subtle (and you, perhaps, find out about it only much later).

Robert Haas

Thursday, May 25, 2023

Do I Really Need That backup_label File?

2 comments: