Monday, October 17, 2011

Deadlocks

Last week, someone pinged me on instant messenger to ask about the following message, which their PostgreSQL instance had just produced:

DETAIL: Process 22986 waits for ShareLock on transaction 939; blocked by process 22959.
Process 22959 waits for ShareLock on transaction 940; blocked by process 22986.

This message is a complaining about a deadlock.  But unless you've seen and debugged these a few times before, it might not be entirely obvious to you what's actually going on here.  What, exactly, did the offending processes do that caused the problem?

Of course, any deadlock is always going to involve at least two processes.  Process A is holding lock X and waiting for lock Y, while process B holds lock Y and is waiting for lock X.  Neither process can make forward progress, because each is waiting for a lock.  And neither will ever be able to obtain the lock they need, because in each case the other process holds it.  In this situation, the deadlock detector kicks in, issues a message like the above, and rolls back one of the two transactions so that the other can make progress.  Not exactly wonderful, but better than having both processes sitting there forever doing nothing.

Of course, rollbacks are not ideal, so you'd prefer to design your application in a way that avoids creating deadlocks in the first place.  For example, if you're locking tables, you want to always do it in the same order.  If one session does LOCK TABLE A and then LOCK TABLE B, and another session does LOCK TABLE B and then LOCK TABLE A, then you might be unlucky enough to end up with each of them holding one lock and waiting for the other.  Then you'll get something like this:

DETAIL:  Process 21535 waits for AccessExclusiveLock on relation 342640 of database 41454; blocked by process 21506.
Process 21506 waits for AccessExclusiveLock on relation 342637 of database 41454; blocked by process 21535.

If you look carefully at this message, you'll see some differences from the one at the top of the email, most importantly that here each process is waiting for a lock on a relation, which is a PostgreSQL term that refers generically to a table, view, index, sequence, foreign table, composite type, or similar internal object.  And, at least if you know the trick, it's not that hard to translate from the numbers given in the error message to actual table names:

rhaas=# SELECT 342640::regclass, 342637::regclass;
 regclass | regclass
----------+----------
 b        | a
(1 row)

Once you know what "relation" means and how to decode the numbers in the error message, it's fairly easy to see what's going on here.  But the message I started this blog post with is trickier.  There, it's not table locks that are causing the problem, but transaction locks.  Each process is waiting for a lock on some transaction which, apparently, the other process has locked.  What does that even mean?

The short answer is that it means that each process is trying to update a tuple which the other process has already updated.  When a process updates a tuple, it locks it, and that lock is not released until the transaction completes (either by committing or rolling back).  So, if one transaction tries to update tuple X and then tuple Y, while another transaction which has already updated tuple Y tries to update tuple X, you're going to end up with a deadlock.  But you might expect, then, that the error message would complain about each process waiting for a tuple lock held by the other process, whereas the actual message complains about a transaction lock held by the other process.

In fact, any time you see a complaint about a transaction lock involved in a deadlock, it's a good bet that a tuple lock is really to blame.  PostgreSQL's deadlock detector works entirely based on data structures stored in shared memory - but it's impractical to store tuple locks persistently in shared memory, because a single transaction could update many millions of tuples before committing, and we'd run out of memory.   To work around that problem, tuple locks do a clever little dance that minimizes the number of locks that must be tracked in shared memory.

Here's how it works.  Each tuple on disk has space for a 4-byte xmax - the transaction ID (or XID) of the process that has updated or locked it (an update is effectively a lock, and you can also lock the tuple without updating it, using SELECT FOR UPDATE).  To lock the tuple, a transaction simply sets xmax to its own transaction ID.  But, there's a problem: what happens if xmax has already been set to some other transaction ID?  If that transaction ID is no longer running, then there's no problem: the lock is effectively released, and we can just grab it.  But if that transaction is still running, we need to wait for it to commit or roll back.  And we can't just wait by, say, going to sleep, because that might lead to undetected deadlock.  The deadlock detector only knows about locks that are recorded in shared memory, so we can't just go to sleep indefinitely using some other mechanism, or it won't be able to notice when there is a problem.  As annoying as deadlocks can be, they're still a lot better than undetected deadlocks.  So we need to wait by acquiring a lock that the deadlock detector knows about.

What happens is this: every transaction takes an exclusive lock on its own XID that is released only at transaction end.  Other processes that want to wait for a transaction to complete take a share lock on its XID and, once they get it, immediately release it.  That way, when a process is waiting for a transaction to complete, the deadlock detector knows about it, and can do the right thing.

There's actually one further wrinkle: we want to make sure that if a bunch of processes queue up waiting for the same tuple lock, they get the lock more or less in sequence.  To make that work, we add one more wrinkle: the tuple lock is taken temporarily in shared memory until it can be recorded in the xmax field of the tuple itself (the lock manager will normally granted any given lock in FIFO order, though it can rearrange things when necessary to minimize deadlocks).  So the overall process looks something like this:

- take tuple lock in shared memory (waiting if necessary)
- take transaction lock on current xmax value (waiting if necessary)
- release transaction lock (we now know that that transaction has ended)
- record our XID in the tuple's xmax field
- release tuple lock in shared memory (but we still effectively have the lock, because xmax = our XID)

The way that this typically shakes out is that, when there's a conflict over a tuple lock, the process which is first in line for the tuple lock appears to be waiting on a lock on the transaction ID of the process which locked the tuple.  Any other waiters appear to be waiting for the tuple lock itself.  Consequently, the number of locks that must be tracked in shared memory is approximately equal to the number of running transactions plus the number of tuples that are in the process of being locked - which can't be very many, since each session can only be trying to acquire at most one lock at a time. Once a tuple lock is obtained, the information about that lock is contained exclusively within the tuple itself, making it possible for a single transaction to update a basically unlimited number of rows without exhausting the memory space available for the lock table.

Most of the time, of course, this is all invisible and happens under the covers, but it creeps out where you can see it when a deadlock occurs - or when you look at the system view pg_locks.

1 comment:

  1. Nice post. I came across this "Waiting on transaction" once in slightly different circumstances: As I traced it, I noticed that the actual wait came from nbtinsert.c:

    /* Have to wait for the other guy ... */
    _bt_relbuf(rel, buf);
    XactLockTableWait(xwait);

    I had one process that updated two tables with only INSERTs in two different transactions. One would think that doing inserts only could not result in a deadlock...

    One table however had a foreign key to the other table, but I would only insert data to the key table that was not yet there and here was nobody else adding data, so still you would assume that there can not possibly be a deadlock.

    Now what happened IIRC was that the txn1 added rows to the table that referenced the other table and this took share locks on the keys (to make sure they wouldn't go away).

    Then txn2 did inserts into the key table which would then result in locking from the location mentioned above.

    I think that this way both INSERT-only transactions accumulated locks and that finally gave the deadlock. My solution was to change the FK constraint to DEFERABLE so that it only checks at transaction end, grabs all the locks at once and then releases them.

    If that makes sense and you'd like to post about this, I'd be happy to read your analysis, I've tried to reproduce it once, but it actually didn't work out (i.e. it did not block).

    ReplyDelete