Monday, October 24, 2011

PostgreSQL Crash Debugging

As I mentioned in a previous blog post, I spend some of my time working in and with EnterpriseDB's support department.  And what that means is that every customer I talk to has a problem, typically a fairly serious problem, and they want  me to help them fix it.  Of course, to fix it, you first have to be able to identify the problem, and sometimes that's not so simple.  Database crashes can be among the more difficult cases to debug.

Of course, PostgreSQL is a pretty reliable database - that's one of the major reasons why I have been using it for more than 10 years now, and involved in the development process for about 3 years.  But there are still bugs, especially in betas or newly released major versions.  Now, some of those bugs involve a user who tried something new and different and, boom, the database crashed.  Those are typically not difficult to debug, because the user has a reproducible test case.  A developer can simply do the same thing the user did on their own copy of PostgreSQL, but with a debugger attached, and in most cases it quickly becomes obvious what the problem is.

The really annoying crashes to debug are random crashes that can only be reproduced under load.  The user doesn't know what SQL statement is causing the crash (or even whether it's being caused by an SQL statement or some external event, like bad RAM or an uncooperative operating system - I'm looking at you, Microsoft Windows).  In current versions of PostgreSQL, the postmaster reports the PID of the process that crashed, and the signal or exit code reported by that process, which really doesn't tell you all that much.  What you typically want to know is what that process was doing at the time that it kicked the bucket.

Now, the good news is that there are a couple of possible methods for obtaining that information.  One trick is to set log_line_prefix to some string which contains %p and log_min_duration_statement=0.  This includes the PID in each message written to the log file, and logs all queries.  That way, when the crash happens, you can look at the PID in the crash message and read backwards through the log file until you find the query that caused the problem.  It can also be helpful to set log_min_messages=DEBUG2 (or higher); sometimes that provides some additional detail on where the process was in its execution when it crashed.  The downside of this approach is that generates a lot of log chatter, which can be problematic in some workloads.  If you're really unlucky (and I sometimes have been), cranking up the log level slows the system down just enough to prevent the race condition that's causing the crash from happening in the first place.

Another approach is to enable core dump collection, and then extract a stack backtrace (or print out the value of the variable debug_query_string) once you have a core dump.  On Linux, this can be done by running ulimit -c unlimited before starting the server, or by using the -c option to pg_ctl start.  On Windows, if you're running PostgreSQL 9.1, you can create a "crashdumps" subdirectory inside the data directory.  On earlier versions, it's harder.

A major disadvantage of all of these approaches is that they require advance planning.  So, when a customer has an unexpected crash, we often have to ask them to do some of the work described above, and then wait for the crash to recur.  This lengthens the debugging cycle.  That's why I'm pretty excited about a new feature just committed PostgreSQL 9.2, which Marti Raudsepp wrote, I committed, and Tom Lane further improved: when a backend crashes, it tries to dump out the query string (or the first 1024 bytes of the query string).  We'll need to get some real-world experience with this before we know for sure how helpful it will turn out to be, but I'm hoping it's going to greatly simplify debugging those rare but very tricky cases where the database crashes at unpredictable intervals.

7 comments:

  1. Come on! You mean there are guys out there using a production PostgreSQL database on a windows machine??? I'm interested in the reason. Why?

    ReplyDelete
  2. I'm happy to say, in ages that I've been using postgresql on live systems (5+ years), never crashed. Even tho we used gist indexes on 8.1, with our own data types, etc. Only things that have crashed were our own functions, during development/testing cycle - but that is to be expected.

    ReplyDelete
  3. @Greg Jaskiewicz: Note that this feature will also be useful for debugging crashes the sorts of crashes you inflict on yourself by mistakes in custom C functions. I agree that crashes are not common, but if you have enough people running enough copies of PostgreSQL on enough systems, they will happen.

    @maxim: Yeah, there are quite a few of them, actually. My personal preference would be Linux, but many people are more comfortable working in a Windows environment, and having a Windows port available for PostgreSQL has, I think, done very good things for PostgreSQL adoption. So it's something that we need to continue to support. Keep in mind that not all production deployments of PostgreSQL involve a dedicated machine; PostgreSQL is not infrequently bundled with other software as part of a larger solution.

    ReplyDelete
  4. Robert, I have about 20 customers. Each one of them running from 3 to 50 boxes. Say 10 on average, that gives me 200 installations of postgresql. Same database schemas, different usage patterns.
    So far I've been very happy (8.3.x). Moving to 9.1 in the future, will see how that stands the test :)

    Now, I would love a feature where postgresql would produce backtrace on its own before crashing. Using one of many available libs that do it. I suppose, the alternative is just to core.

    Very happy so far.

    ReplyDelete
  5. We are using postgres 8.4.x for some time already and I must say it's been pretty reliable database. Probably better than many commercial DBs. I only have seen few times when the DB would terminate abnormally. One was back in 8.3.older_release when if you terminate vacuum analyze command - it would take down the server. This was fixed long time ago ;)
    The other was a one of these random cases, which may be hardware related: under really heavy DW-type load (monthly aggregates) around 15 "group by" queries running at the same time with work_mem tweaked to between 1 and 4GB (server has 72G RAM). Sometimes Postgres would start run out of physical and then virtual memory and then will be terminated by linux. When I rerun the same queries (with same parallelism and same memory settings) - they would work. I have fixed this by lowering work_mem in individual queries.

    However I was never been able to get to the bottom of this issue, so any debugging/logging info will be useful. Also it would be useful if it can be turned on/of from psql..
    Thanks.

    ReplyDelete
  6. @maxim
    I know your comment was only half-serious, because Linux and UNIX are truly superior platforms for an SQL DBMS, but it concerns me when some people takes questions like that too seriously. Since I've been in IT, I've seen both Windows and Linux zealots try to force their favorite software down other people's throats rather than picking the best tool for the job. I really detest Windows, but it's the best OS for my company (for now) because we are in industrial automation where protocols like OPC tie us to Windows, and the customer often has an embedded PC here or there rather than centralized IT. If you have only one computer on an isolated network and it has to run Windows, wouldn't you be happier for PostgreSQL to be running on that Windows box rather than MS SQL Server? :-)

    ReplyDelete
  7. Windoze has been great for PostgreSQL and it is still presently the right tool for many jobs. :-)

    ReplyDelete