Working in the support department (as I do) is a good way to hear lots of horror stories. From time to time, we get reports of what can only be described as database corruption: errors reading tables (or even trying to connect to the database) due to missing files, corrupted data blocks or tuples, or rows that show up in the table but not its indexes. Attempting to decode a corrupted tuple can cause PostgreSQL to attempt to allocate a gigantic amount of memory (and fail with an ERROR) or even to die with a segmentation fault, aborting all in-progress transactions and restarting the server. Such occurrences are painful, to say the least, and users are often left wondering how it happens. How exactly does the database get corrupted?
One obvious possibility is bad hardware. If your hardware is defective, then the data we read back from the disk may be different from the data we previously wrote to the disk, and at that point, all bets are off. More rarely, it's the memory rather than the disk that goes bad; typically, this causes actual database crashes rather than subtler symptoms of corruption, but not always. Kernel or filesystem bugs can masquerade as hardware corruption; surprisingly, there really are kernels and filesystems out there that don't always return the data that PostgreSQL last wrote when it's read back, often due to locking bugs that only manifest at high levels of concurrency.
Operating system crashes (such as a reboot due to unexpected power loss) can cause database corruption, if the server is misconfigured or if the kernel and storage hardware report success before they've actually written the data to stable storage. PostgreSQL relies on write-ahead logging to provide crash recovery; to operate properly, PostgreSQL must be able to guarantee that write-ahead log entries reach stable storage before the corresponding data file changes. There are a variety of ways that this can fail. First, if the configuration parameter fsync is set to off, then PostgreSQL will not even attempt to flush data to disk at the appropriate times. This can results in a large performance gains, but the price is that crash recovery is no longer reliable. Do not configure fsync=off unless you are willing to recreate your entire database from scratch (or restore from backup) after an operating system crash. Second, on some systems, the operating system will report that fsync() has succeeded even though the data hasn't truly been written to stable storage. On Windows and MacOS X, this can be avoided by configuring wal_sync_method=fsync_writethrough. On Linux, it's sometimes necessary to disable write-caching at various levels, change mount options or filesystem selection, or simply run with a newer Linux distribution where the code is less buggy. New versions of ext4 seem to be smarter about this than older ext3 code. Greg Smith has written extensively on this topic; see http://wiki.postgresql.org/wiki/Reliable_Writes.
Mistakes during backup and restore are a surprisingly frequent cause of database corruption. On the backup side of the equation, people sometimes neglect to use pg_start_backup() and pg_stop_backup when making a base backup, or do so incorrectly. If running your backups from a script, it is important to include error checks in the script, or at the very least to save the output to a log file. Otherwise, it is difficult to know, later on, whether an unexpected error might have occurred, resulting in an incomplete or corrupt backup. Especially on Windows, it's important to make sure that the program used to do the actual file copy can copy files even if they are open and being concurrently modified by other processes.
Errors on the restore side are even more common, and there are two that are very easy to make: failing to create recovery.conf, or removing the backup_label file. A hot backup taken using pg_start_backup and pg_stop_backup may be inconsistent, since changes to the data files can occur while the backup is in progress. To ensure a consistent backup, the write-ahead log entries generated during the backup must be replayed in their entirety. To make sure this happens, you must create recovery.conf before starting the server on your backup copy of the files, as described in step 7 of the PostgreSQL documentation for Recovering Using a Continuous Archive Backup. If you do not, the server will believe it's doing crash recovery and may not replay the write-ahead log as far as necessary to reach consistency, resulting in a corrupt database. Similarly, you should never remove the backup_label file when recovering from a hot backup: otherwise, the server may begin recovery at a checkpoint that happened some time in the middle of the hot backup, rather than (as would be correct) at the checkpoint that occurred when you ran pg_start_backup(). The server will sometimes issue a hint that says "If you are not restoring from backup, try removing the file backup_label"; and it is important to realize that it is only acceptable to remove this file if the master crashes during a hot backup and you are attempting to get the master back up and running. If you remove the file on the backup machine, you may end up with a corrupted database.
In lieu of the hot backup procedure using pg_start_backup and pg_stop_backup, some users opt to use a SAN or some other storage technology to make an instantaneous snapshot of the state of the database. When PostgreSQL is started on the backup directory, it will appear to the server that the system crashed at the instant the snapshot was taken, and crash recovery will return the system to a working state, without corruption. The PostgreSQL documentation on File System Level Backup describes this as taking a "consistent snapshot". The trick is that it's only reliable if the snapshot is truly instantaneous. If it's not - you guessed it - you may end up with a corrupted database. This can be particularly tricky to manage when you have multiple tablespaces spread across several different file systems.
You can also end up with a corrupted database after using pg_resetxlog. If you're unlucky enough to have to use pg_resetxlog, you should use pg_dump to backup and restore your database. Any error that's serious enough to require pg_resetxlog is probably also serious enough to have chewed up your data; it is unwise to rely on the state of the database not being corrupted in any other way.
If you're pretty sure that your hardware is good, your writes are reliable (or you haven't had any crashes), your backup procedures are impeccable (or you've never restored from backup), and you've never run pg_resetxlog, then you might have found a server bug (example, patch). Detailed bug reports are always appreciated, so that we can try to fix the problem.