Tuesday, October 15, 2024

Is pg_dump a Backup Tool?

Recently, I've been hearing a lot of experienced PostgreSQL users reiterate this line: "pg_dump is not a backup tool." In fact, the documentation has recently been updated to avoid saying that it is a backup tool, to widespread relief. Experienced PostgreSQL users and developers have been publicly called out for having the temerity to assert that pg_dump is, in fact, a backup tool. I find this narrative deeply frustrating, for two reasons.

The first reason, and the less important of the two, is that it's factually wrong. Prior to July 31, 2024, the documentation had described pg_dump as a backup tool for a very long time. The wording in question was added on April 17, 2003. Now, to be fair, fact that the documentation has said a certain thing for 20+ years does not automatically make it true. But the fact is that if you use pg_dump and pg_dumpall properly, you can save all the data in your database and load it into a new database at a later time. And, prior to becoming a PostgreSQL hacker, I did exactly that for years and years, and it worked great for me. If by chance my database went down and I lost the contents of my data directory, I could restore from one of my saved dumps and get the system BACK UP.

You know what I call a system that lets you do recover your data and get back up when disaster strikes? That's right: I call it a backup tool.

That's not to say that I particularly recommend this approach for the typical user. At the time, I was working with very small PostgreSQL databases with only a small number of users and a small amount of data, even by the standards of that day. Backing up large databases with pg_dump can be very slow, and restoring them can be very slow, too. Because pg_dump holds an MVCC snapshot while it's running, the lengthy backup can be a problem not only in terms of resource utilization, but also because it could result in database bloat that you subsequently need to clean up with VACUUM or, if things get real bad, maybe even with VACUUM FULL. That's not fun at all.

Furthermore, I was (at least in my own opinion) a highly sophisticated user who was prepared to fix any problems by hand, and sometimes I had to do that. For instance, if you use pg_dump to dump a database and later try to restore that dump on a newer version of the database, you may get errors. If you take the dump using the version of pg_dump that matches the server version where you intend to restore, this problem shouldn't happen, but if you try to restore an existing, older dump on a new version, it can, and sometimes I ended up doing that, and having to manually adjust the dump to get it to restore.

Also, if you use pg_dump, you also need to remember that you need to run pg_dump on each database individually, plus pg_dumpall -g to save all of your globals, plus back up any configuration files like pg_hba.conf or postgresql.conf that aren't saved by either pg_dump or by pg_dumpall. Unless you really know what you're doing, this is all a bit error-prone.

This gets at another weakness of the pg_dump approach, which is that pg_dump, like pg_basebackup, will give you no help at all in orchestrating your backups. If you want to keep the last N backups, for example, pg_dump will not provide you with any assistance. When you need to restore your global objects (from pg_dumpall -g) and then all of your individual database objects (from pg_dump on each individual database), you'll get no help doing all of that in the right order or with the right amount of parallelism, just as you'll also get no help deciding whether to take the dump in tar or custom format in the first place. That's all stuff you have to figure out.

For all of these reasons, and probably some others that I'm forgetting, it's usually going to be better to use a professionally written backup out-of-core backup tool like barman or backrest.  Those tools will arrange to take a physical copy of the database files plus the associated WAL files necessary to make the database consistent. Aside from avoiding some of the problems already mentioned, that lets you do point-in-time recovery. If you use pg_dump to save the contents of your database once a day -- and let's just suppose that your database is small enough that this is a feasible approach -- then when disaster strikes you will have to go back to the last daily backup, and all changes since then are gone. If, on the other hand, you use a proper backup tool that is correctly configured, you should be able to arrange to replay any WAL that was generated after the last backup and successfully archived or streamed to another server, which should mean that you have only seconds of data loss, or none at all, rather than hours of data loss. For a lot of people, that's huge.

But none of that means that pg_dump isn't a backup tool. What it means is that pg_dump isn't the right backup tool for most people. Using a well-maintained and perhaps professionally supported tool has tons of advantages and is almost always going to be better than a home-grown script, regardless of whether your home-grown script uses pg_basebackup or pg_dump or some other mechanism. But if in your particular use case what pg_dump does is more suitable for your needs, then absolutely use it and don't let anyone make you feel guilty about it. I did it for a long time, and one of the things that I loved about it (as a relative PostgreSQL novice) is that I had all of my data in what was basically a text file. If I had a copy of the data directory on disk, then I needed to rely on PostgreSQL to make sense of that, and if it couldn't, I was in trouble. But I was pretty confident that no matter how much went wrong, I was always going to be able to find some way to recover my data from the output of pg_dump, because the text file was something that I could read, understand, and get data out of somehow or other, even if it involved breaking out awk and sed.

Even today, I think it's completely reasonable to want to have a logical copy of your database somewhere, rather than just a physical copy. If I were doing it these days, and assuming the database was of more than trivial size, I might think of doing that with logical replication (which did not exist at that time) rather than with pg_dump. One big disadvantage of relying entirely on physical backups is that if some bytes get flipped and make some of your data corrupt or unreadable, that damage may also be replicated. That's less likely to happen with a logical copy -- once the source data becomes unreadable, any changes to that data are likely to fail, or at least cause logical replication to cease, which can prevent the corruption from spreading to all of your copies. Had my nightly pg_dump run failed back in the day because some of the source data was unreadable, I would have noticed the next day, or even sooner, because I actively monitored the success and failure of my pg_dump-based backups, rather than just assuming everything was working. When things went wrong, I could immediately take steps to address the problem before things got worse.

The second, and more serious, problem with the "pg_dump is not a backup tool" narrative is that it's invalidating. Without offering any specific fact or theory or argument, it asserts that if you think that pg_dump is a backup tool, you're foolish. You're so foolish, in fact, that the speaker need not justify the outlandish claim that pg_dump does not do what the documentation has clearly said that it does for more than two decades. If you haven't heard somebody else say it already, then allow me to be the first: you are not foolish for thinking that pg_dump can be used to back up your PostgreSQL instance. It totally can. It is often not the best approach, but sometimes it may be just perfect. The case mentioned above, where you want to have the contents of your database in text format that you can read with non-PostgreSQL tools, is just one example of where that may be so. Others exist, too.

More generally, you're not foolish for wanting a clear, understandable reason for any best practice that someone says you should follow, whether PostgreSQL-related or otherwise. Best practices are the fruit of human experience. People try a bunch of things and find out which ones work well and what goes wrong when they do something else. It's the stuff that goes wrong when the best practice is not followed that makes it a best practice, and it's the ability to explain what the hazards are that makes someone an expert. "Don't use pg_dump," like any other rule, is as valid as the argument used to justify it.

10 comments:

  1. Except pg_backup/pg_restore doesn’t restore to identical database in all cases:

    https://pgsql-bugs.postgresql.narkive.com/o34K18tC/bug-9555-pg-dump-for-tables-with-inheritance-recreates-the-table-with-the-wrong-order-of-columns

    ReplyDelete
    Replies
    1. That link doesn't load for me. I do totally believe that pg_dump has bugs and limitations, but so does most software. There's no guarantee that WAL recovery is perfect either, although I have found it and pg_dump to be pretty reliable in most cases. In any case, what a piece of software is designed to do and how many bugs it has are two separate questions.

      Delete
  2. Well said!

    Restoring binary data requires the same Postgres version, OS, CPU architecture, and packaging / build options.

    Restoring from pg_dump works almost universally with the same Postgres version, and we can manually fix problems caused by version differences as you noted you have done in the past, and as many of us have done.

    Of course pg_dump isn't the best backup tool for big databases, for everyone, or for all circumstances, but it certainly is a useful, time-tested backup tool.

    ReplyDelete
    Replies
    1. Long time, no see, Jon! Hope we can catch up sometime.

      Delete
  3. Thank you for this testimony and clarification.
    I use pg_dump since PostgreSQL 9.0
    1- It is easy to use at clients with pgAgent
    2- I did not find for windows tools dedicated to backup/restore
    (translated by google translation)

    ReplyDelete
  4. I agree pg_dump/pg_restore is a backup solution.

    Actually: Is there any other backup solution which allows one to selectively back up only some of the databases in a Postgres instance? The ability to do that is rather important, because a Postgres instance may host databases with very different backup needs: Some databases may be completely derived from other sources and it may be quicker to restore it by reloading its data. Other databases may be pre-production related and it's easier to make a fresh copy from production than to restore from backup.

    Sure, one could in principle make a complete Postgres instance per database, but that would be very wasteful of memory.

    ReplyDelete
    Replies
    1. That's what I love about pg_dump, the ability to selectively backup some databases and even schemas. Like for example I have a schema called staging which I used just for preloading data from files. I don't need that staging backup backed up and in fact can be huge. With pg_backup, I get a much smaller backup of only critical stuff.

      Delete
    2. Well, pgBackRest allows for selective restoration, at least. However, I agree it is much more simpler to perform selective backups with pg_dump.

      Delete
  5. Even with replication, and zfs snapshots and sending of my datadir, I still kept `pg_dump`ing data to disk. It was part of my process for restoring data that was deleted, then later discovered that it should not have been deleted. `pg_restore -l` and `pg_restore -L ... | awk ...` made this process very simple. Binary backups mean I have to restore the backup to a new server / VM so I can merge old data back into the live database.

    It's all layers of defense.

    ReplyDelete
  6. Sound to me like asking if dump(1) is a backup tool or not. Quite frankly, I always present pg_dump as a "base backup tool", explaining that users will need some more logic to add for a complete and reliable backup. I would like to know if Oracle's exp is a backup tool...

    ReplyDelete