Monday, March 26, 2012

Why Is My Database Corrupted?

Working in the support department (as I do) is a good way to hear lots of horror stories.  From time to time, we get reports of what can only be described as database corruption: errors reading tables (or even trying to connect to the database) due to missing files, corrupted data blocks or tuples, or rows that show up in the table but not its indexes.  Attempting to decode a corrupted tuple can cause PostgreSQL to attempt to allocate a gigantic amount of memory (and fail with an ERROR) or even to die with a segmentation fault, aborting all in-progress transactions and restarting the server.   Such occurrences are painful, to say the least, and users are often left wondering how it happens.  How exactly does the database get corrupted?

One obvious possibility is bad hardware.  If your hardware is defective, then the data we read back from the disk may be different from the data we previously wrote to the disk, and at that point, all bets are off.  More rarely, it's the memory rather than the disk that goes bad; typically, this causes actual database crashes rather than subtler symptoms of corruption, but not always.  Kernel or filesystem bugs can masquerade as hardware corruption; surprisingly, there really are kernels and filesystems out there that don't always return the data that PostgreSQL last wrote when it's read back, often due to locking bugs that only manifest at high levels of concurrency.

Operating system crashes (such as a reboot due to unexpected power loss) can cause database corruption, if the server is misconfigured or if the kernel and storage hardware report success before they've actually written the data to stable storage.  PostgreSQL relies on write-ahead logging to provide crash recovery; to operate properly, PostgreSQL must be able to guarantee that write-ahead log entries reach stable storage before the corresponding data file changes.  There are a variety of ways that this can fail.  First, if the configuration parameter fsync is set to off, then PostgreSQL will not even attempt to flush data to disk at the appropriate times.  This can results in a large performance gains, but the price is that crash recovery is no longer reliable.  Do not configure fsync=off unless you are willing to recreate your entire database from scratch (or restore from backup) after an operating system crash.  Second, on some systems, the operating system will report that fsync() has succeeded even though the data hasn't truly been written to stable storage.  On Windows and MacOS X, this can be avoided by configuring wal_sync_method=fsync_writethrough.  On Linux, it's sometimes necessary to disable write-caching at various levels, change mount options or filesystem selection, or simply run with a newer Linux distribution where the code is less buggy.  New versions of ext4 seem to be smarter about this than older ext3 code.  Greg Smith has written extensively on this topic; see

Mistakes during backup and restore are a surprisingly frequent cause of database corruption.  On the backup side of the equation, people sometimes neglect to use pg_start_backup() and pg_stop_backup when making a base backup, or do so incorrectly.  If running your backups from a script, it is important to include error checks in the script, or at the very least to save the output to a log file.  Otherwise, it is difficult to know, later on, whether an unexpected error might have occurred, resulting in an incomplete or corrupt backup.  Especially on Windows, it's important to make sure that the program used to do the actual file copy can copy files even if they are open and being concurrently modified by other processes.

Errors on the restore side are even more common, and there are two that are very easy to make: failing to create recovery.conf, or removing the backup_label file.  A hot backup taken using pg_start_backup and pg_stop_backup may be inconsistent, since changes to the data files can occur while the backup is in progress.  To ensure a consistent backup, the write-ahead log entries generated during the backup must be replayed in their entirety.  To make sure this happens, you must create recovery.conf before starting the server on your backup copy of the files, as described in step 7 of the PostgreSQL documentation for Recovering Using a Continuous Archive Backup.  If you do not, the server will believe it's doing crash recovery and may not replay the write-ahead log as far as necessary to reach consistency, resulting in a corrupt database.  Similarly, you should never remove the backup_label file when recovering from a hot backup: otherwise, the server may begin recovery at a checkpoint that happened some time in the middle of the hot backup, rather than (as would be correct) at the checkpoint that occurred when you ran pg_start_backup().  The server will sometimes issue a hint that says "If you are not restoring from backup, try removing the file backup_label"; and it is important to realize that it is only acceptable to remove this file if the master crashes during a hot backup and you are attempting to get the master back up and running.  If you remove the file on the backup machine, you may end up with a corrupted database.

In lieu of the hot backup procedure using pg_start_backup and pg_stop_backup, some users opt to use a SAN or some other storage technology to make an instantaneous snapshot of the state of the database.  When PostgreSQL is started on the backup directory, it will appear to the server that the system crashed at the instant the snapshot was taken, and crash recovery will return the system to a working state, without corruption.  The PostgreSQL documentation on File System Level Backup describes this as taking a "consistent snapshot".  The trick is that it's only reliable if the snapshot is truly instantaneous.  If it's not - you guessed it - you may end up with a corrupted database.  This can be particularly tricky to manage when you have multiple tablespaces spread across several different file systems.

You can also end up with a corrupted database after using pg_resetxlog.  If you're unlucky enough to have to use pg_resetxlog, you should use pg_dump to backup and restore your database.  Any error that's serious enough to require pg_resetxlog is probably also serious enough to have chewed up your data; it is unwise to rely on the state of the database not being corrupted in any other way.

If you're pretty sure that your hardware is good, your writes are reliable (or you haven't had any crashes), your backup procedures are impeccable (or you've never restored from backup), and you've never run pg_resetxlog, then you might have found a server bug (example, patch).  Detailed bug reports are always appreciated, so that we can try to fix the problem.


  1. Another source of corruption I've witnessed is someone putting their database cluster on shared storage, and pointing 2 separate instances of PostgreSQL on separate servers at the same data directory, with one supposedly being a standby. This *could* be okay if there were processes which monitored both servers and could guarantee that only one could run at a time. But unfortunately the failover mechanism was completely ignorant of the other server, and after the primary server had crashed, the instance of PostgreSQL on the standby was manually started. Later, the original primary was rebooted as part of maintenance, and of course PostgreSQL started on that during boot and tried to use the already-in-use data directory... producing a series of horrific error messages.

  2. I continue to believe that if we had page-level checksums we would find a lot of cases of corruption that have been hitherto undetected. Every time I add sanity checks to a system, they start tripping and asserting on bugs that had been silently causing problems. I mentioned this a while ago too:

    1. Checksums are available beginning in PostgreSQL 9.3.

      To use them, initialize your database cluster with 'initdb -k'.

  3. Great writeup, Robert.

    I'll be really pleased when postgresql gets page checksums too. I remember a time when innodb having not one but two copies of the checksum (one at the start and one at the end of each page) was enough to show we almost certainly had a faulty raid card that had corrupted the page write - I would have spent weeks trying out different versions and checking for software bugs! (And indeed swapping machine at the colo did solve the problem that had plagued us.)

  4. Hi,
    My name is Tatiana, and I have a problem with a PostgreSQL database. When I try to access the program and type the password I receive an error message:

    Server doesn`t listen
    The server doesn`t accept connections: the connection library reports
    Could not connect to server: Connection refused (0x0000274D/10061) Is the server running on host “” and accepting TCP/IP connections on port 5432?
    If you encounter this message, please check if the server you`re trying to contact is actually running PostgreSQL on the given port. Test if you have network connectivity from your client to the server host using ping or equivalent tools. Is your network / VPN/SSH tunnel / firewall configured correctly?
    For security reasons; PostgreSQL does not listen on all available IP address on the server machine initially. In order to access the server over the network, you need to enable listening on the address first.
    For PostgreSQL servers starting with version 8.0, this is controlled using the “listen _address” parameter in the postgresql.conf file. Here, you cant enter a list of IP addresses the server should listed on, or simply use ‘*’ to listen on all available IP addresses. For earlier server (Version 7.3 or 7.4), you`ll need to set the “tcpip_socket” parameter to ‘true’.
    You can use the postgresql.conf editor that is built into pgAdmin III to edit the postgresql.conf configuration file. After changing this file, you need to restart the server process to make the setting effective.
    If you double-checked your configuration but still get this error message, it`s still unlikely that you encounter a fatal PostgreSQL misbehavior. You probably have some low level network connectivity problems (e.g. firewall configuration). Please check this thoroughly before reporting a bug to the PostgreSQL community.

    I checked the postgreconf.file, and it is configurated with ‘*’ for “listen_addresses”. The other thing that I checked was the firewall, which said “Allowed the conection” to port 5432.

    The logs file says:
    2012-04-03 14:48:17 VET LOG: database system was interrupted while in recovery at 2012-04-03 14:39:01 VET
    2012-04-03 14:48:17 VET HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
    2012-04-03 14:48:17 VET LOG: database system was not properly shut down; automatic recovery in progress
    2012-04-03 14:48:17 VET LOG: redo starts at 1D/2D0689F0
    2012-04-03 14:48:17 VET LOG: loaded library "$libdir/plugins/plugin_debugger.dll"
    2012-04-03 14:48:17 VET FATAL: the database system is starting up
    2012-04-03 14:48:18 VET LOG: could not open file "pg_xlog/000000010000001D00000031" (log file 29, segment 49): No such file or directory
    2012-04-03 14:48:18 VET LOG: loaded library "$libdir/plugins/plugin_debugger.dll"
    2012-04-03 14:48:18 VET FATAL: the database system is starting up
    2012-04-03 14:48:19 VET LOG: startup process (PID 2580) exited with exit code 255
    2012-04-03 14:48:19 VET LOG: aborting startup due to startup process failure

    When I went to “Services” to start the service, I received this message:
    The PostgreSQL Database Server 8.3 service on Local Computer started and then stopped. Some services stop automatically if the are not in use by other services or programs.
    Thank you for your time and assistance.

    1. Use pg_resetxlog $PGDATA.

  5. Great write-up. Thanks Robert.

    I have a small query. Lets say that due to some corruption, automatic recovery is triggered by postgres. This results in "redo start at 0/9A3F58" as I can in the database logs. As part of the recovery, I suppose it would try to insert the records for a table. Does it cause database triggers for that table to be executed as well.

    1. No. Redo will replay whatever the triggers did before, but they won't actually be executed again.

  6. Virtually every case of database corruption that I've ever run into was on a developer database server where the developers rarely look to see whether the tablespace they're about to sling a 50 gigabyte database onto actually has sufficient room for that database. Invariably we get the situation where Postgres thinks it extended a file but didn't because the filesystem was out of space, but records that it stored data to the place it couldn't store data because (duh) the filesystem was full, and then we end up having to do a drop of every database in the tablespace, destroy it, and recreate it from scratch in order to deal with the fact that Postgres apparently doesn't notice that, well, it ran out of space. It's annoying as heck and I've tried to train the developers "don't do that!" but you know how it goes, it's like herding puppies... but man, it sure would be nice if Postgres could handle that without corrupting the whole database. SIGH.

    1. This is supposed to work properly without corrupting the database. If you're sure it doesn't, you might want to post a reproducible test case to pgsql-bugs.