Five days before Christmas I committed my patch to add incremental backup to PostgreSQL. Actually, I've been committing preparatory patches for some months now, but December 20 saw the two main patches land. Since then, there's been a bunch of bug-fix commits, and there are still a few pending items that need to be addressed, but the core of the feature is now committed. If you want a quick overview of the feature, Lukas Fittl has a great video about that. Here, I'd like to talk about the architecture of the feature itself in a little more detail, and specifically with how we decide which data to copy.
Most people who are likely to read this blog are probably already familiar with the core idea of an incremental backup: instead of copying the whole database instance, just copy the data that has changed. That's faster, and takes up less space on disk. But, to work properly, you have to be able to quickly and reliably identify which data has, in fact, changed. There's more than one way to do that.
General file-copy tools like rsync use file sizes, modification times, and checksums to identify and skip unchanged files, but I didn't really like those kinds of approaches for this project. First, they work most naturally with whole files, and I wanted block-level incremental backup. That is, if there's a large file, and some blocks have changed but must haven't, then I wanted to be able to identify and send only the modified blocks. File sizes and modifications don't help with that problem. Checksums can, if they're computed per-block rather than for a whole file, but I didn't want to assume that we had the entire previous backup available at the time of taking the incremental backup. Second, I didn't really want to depend on the system administrator not setting the system clock backward and thus creating a misleading set of modification times. I know such occurrences are very rare in practice, but the possibility makes me nervous, particularly because, in case of trouble, there's no sure way of knowing whether it happened or not. Third, the granularity of file timestamps differs by platform, and I hate sorting out platform-dependent problems, so I preferred to aim for an implementation that had no platform dependencies. All of that said, I know that pgbackrest has had success with these kinds of techniques and I'm not here to say that you shouldn't rely on them; I'm just explaining my own thought process.
A second possible way of identifying the changed data is to have the server keep a bitmap showing which blocks have been changed, as ptrack does. As with timestamp or file modification time based approaches, I completely believe that it's possible to make this work, but again it wasn't my preferred approach. I was worried about code complexity, storage space requirements, crash safety, and performance. I also didn't want to assume that there could only be one possible "reference backup" at any given point in time. If someone takes a full backup on Monday and another full backup on Wednesday, then I wanted them to be able to take an incremental backup on Friday relative to either Monday's backup or Wednesday's backup, as they wished. I think there are ways that you could use a bitmap-based approach and still satisfy that requirement, but I had what I thought was a better idea, which is to use PostgreSQL's write-ahead log to identify the modified blocks.
It wouldn't work to have the incremental backup itself read all of the write-ahead log records generated since the reference backup, because the write-ahead log is big. You can't ask users to keep around all of the write-ahead log files between the start of the reference backup and the start of the following incremental backup, because they would fill up their disk. Moreover, even if they somehow had that kind of storage space available, the incremental backup process itself would be incredibly slow if it had to read all of that data. Fortunately, we don't need all of the information from the write-ahead log to perform a correct incremental backup. We only need information about which blocks were modified, and which relations were truncated. We do not need to know specifically what modifications were made to each block, and we don't need to know in what order the blocks were modified, and we don't need to know how many times each block was modified. That means that the overwhelming majority of the information present in the WAL can be discarded, allowing us to distill a possibly massive amount of WAL down to a concentrated extract of useful information that will suffice as a driver for block-level incremental backup.
In the committed patch, there's a new WAL summarizer background process which is enabled if you choose to configure summarize_wal = on. When enabled, it writes WAL summary files into your data directory, under pg_wal/summaries. You get one file per checkpoint cycle, and they're really, really small, even if the amount of WAL that you generated in that checkpoint cycle was really, really large. This happens partly because there's so much information that we can throw out, as described in the previous paragraph, and partly because the file format is extremely efficient. (Many thanks to Dilip Kumar, who helped me figure out how to do this in a really compact way.) If you can construct a test case where the summary files are annoyingly large, I'd be interested in seeing it.
When you request an incremental backup, you need to provide the backup manifest from the previous backup that you wish to use as a reference backup. You don't need to have the whole previous backup available, just the manifest. That manifest is sent to the server, which checks to see whether it has WAL summary files for all of the checkpoint cycles between the start of the previous backup and the start of the current one. If any are missing, we error out. Otherwise, we combine them all and then use the result to perform the incremental backup.
The main thing that's complicated here is handling the case where a relation is truncated, or where it's dropped entirely and a new one is created that happens to use the same set of filenames on disk. In such cases, we have to make sure to properly distinguish between the versions of blocks that existed before the truncation or drop/recreate, and versions that existed afterward. Hence, the WAL summarizer has specific awareness of relation truncation events, and of events that drop and recreate relations. It can happen, for instance, that a relation is truncated to zero length and then extended back to a length of 1 before the new backup starts, but the 1 new block that exists in the relation is not yet initialized. In that case, there will be no mention in the WAL that block 0 has ever been modified, but we still can't use the version from the prior backup, because that old version is gone. Because we know that the relation was truncated to a length of 0, we know that any blocks that are present on disk but not in the WAL must be zero-filled, rather than having contents identical to what they had at the time of the reference backup. This was one of the trickiest parts of the patch to get right -- hopefully, it now is. Testing appreciated.
As always, this feature is not guaranteed to appear in the PostgreSQL 17 release until it actually does; there is always a possibility that a problem might be found which makes it necessary to revert the patch. Indeed, if the patch does have serious bugs, it would be far better to find them now rather than later, so please test. The patch has passed all the tests I've thrown at it so far, and Jakub Wartak has done an incredible amount of really great testing as well, but if you're in a position to do so, please build from the master branch and test it out on any scenario that seems potentially interesting or problematic to you. I've done my best to make this feature as good as I am able to make it, but this is open source software and that means you -- yes, you! -- have a part to play in making it even better.
Merry Christmas if you celebrate it, happy holidays otherwise, and happy new year as well. Thanks for reading.
Thank you very much for this, and all others involved
ReplyDeleteptrack (since version 2) allows to take incremental backup from any LSN, because it stores LSNs in its "bitmap" instead of bits.
ReplyDeleteCertainly, it makes "bitmap" 64x larger (or less capacious for the same size).
It has good crash safety since it writes out map on checkpoint, and map update is bound to writting (smgrwrite and company).
So it is fast and reliable even for non-main forks.
Still, compared to parsing WAL, ptrack suffers from "bitmap's" false positives: it will claim some portion of unmodified data as modified. And portion's proportion depends on amount of modified data.
Thanks for the details!
Delete