As I mentioned in a previous blog post, I spend some of my time working in and with EnterpriseDB's support department. And what that means is that every customer I talk to has a problem, typically a fairly serious problem, and they want me to help them fix it. Of course, to fix it, you first have to be able to identify the problem, and sometimes that's not so simple. Database crashes can be among the more difficult cases to debug.
Of course, PostgreSQL is a pretty reliable database - that's one of the major reasons why I have been using it for more than 10 years now, and involved in the development process for about 3 years. But there are still bugs, especially in betas or newly released major versions. Now, some of those bugs involve a user who tried something new and different and, boom, the database crashed. Those are typically not difficult to debug, because the user has a reproducible test case. A developer can simply do the same thing the user did on their own copy of PostgreSQL, but with a debugger attached, and in most cases it quickly becomes obvious what the problem is.
The really annoying crashes to debug are random crashes that can only be reproduced under load. The user doesn't know what SQL statement is causing the crash (or even whether it's being caused by an SQL statement or some external event, like bad RAM or an uncooperative operating system - I'm looking at you, Microsoft Windows). In current versions of PostgreSQL, the postmaster reports the PID of the process that crashed, and the signal or exit code reported by that process, which really doesn't tell you all that much. What you typically want to know is what that process was doing at the time that it kicked the bucket.
Now, the good news is that there are a couple of possible methods for obtaining that information. One trick is to set log_line_prefix to some string which contains %p and log_min_duration_statement=0. This includes the PID in each message written to the log file, and logs all queries. That way, when the crash happens, you can look at the PID in the crash message and read backwards through the log file until you find the query that caused the problem. It can also be helpful to set log_min_messages=DEBUG2 (or higher); sometimes that provides some additional detail on where the process was in its execution when it crashed. The downside of this approach is that generates a lot of log chatter, which can be problematic in some workloads. If you're really unlucky (and I sometimes have been), cranking up the log level slows the system down just enough to prevent the race condition that's causing the crash from happening in the first place.
Another approach is to enable core dump collection, and then extract a stack backtrace (or print out the value of the variable debug_query_string) once you have a core dump. On Linux, this can be done by running ulimit -c unlimited before starting the server, or by using the -c option to pg_ctl start. On Windows, if you're running PostgreSQL 9.1, you can create a "crashdumps" subdirectory inside the data directory. On earlier versions, it's harder.
A major disadvantage of all of these approaches is that they require advance planning. So, when a customer has an unexpected crash, we often have to ask them to do some of the work described above, and then wait for the crash to recur. This lengthens the debugging cycle. That's why I'm pretty excited about a new feature just committed PostgreSQL 9.2, which Marti Raudsepp wrote, I committed, and Tom Lane further improved: when a backend crashes, it tries to dump out the query string (or the first 1024 bytes of the query string). We'll need to get some real-world experience with this before we know for sure how helpful it will turn out to be, but I'm hoping it's going to greatly simplify debugging those rare but very tricky cases where the database crashes at unpredictable intervals.