Have you ever had one of those annoying problems where a query, or some kind of maintenance task such as vacuum, seems to hang, without making any discernable foreign progress, basically forever? If it's a foreground task, you typically think to yourself "oh, that's taking longer than normal" - but then you start to become suspicious as to whether anything's happening at all. You fire up top, or strace, or gdb, or iostat, or some combination of those tools, and eventually decide that, indeed, nothing is happening.
It's probably the case that whatever you're trying to do is blocked waiting for some kind of lock. PostgreSQL supports several different kinds of locks. The one you are most likely to have a problem with is what's called a "heavyweight" lock. For example:
rhaas=# begin;
BEGIN
rhaas=# select sum(1) from s;
sum
---------
1000000
(1 row)
Then, in another window:
rhaas=# truncate s;
The truncate will hang until the transaction that selected from that table commits. This is because the SELECT statement has obtained a lock on table s. It's a very lightweight sort of of lock - it doesn't block any other process from reading data from s, updating s, creating indexes on s, vacuuming s, or most other things that they might want to do. But it will block operations like TRUNCATE or DROP TABLE that would disrupt read-only queries against s. If you want to know more about which operations acquire which types of locks, and how they might conflict, see the documentation on explicit locking.
Fortunately, this kind of problem is relatively easy to diagnose. Fire up psql, and check whether there are any rows in pg_locks where granted = false. If so, those processes are waiting on locks. You need to find out who has the conflicting lock and take appropriate steps to get that lock released (committing or rolling back an in-progress transaction is often the answer; or sometimes you can use the nuclear option of just terminating the server process that holds the locks). For most lock types, you can get this information from pg_locks, too: look for locks on the same object that have granted = true. For tuple locks, it's a bit harder: ungranted tuple locks show up in pg_locks, but they disappear once granted. (PostgreSQL would run out of lock table space on even a medium-sized SELECT FOR UPDATE query if we didn't do this.) So, in that case, you have to look at who is accessing the table and speculate about which one of those people might be holding the conflicting tuple lock. You can look at the current query information in pg_stat_activity, or you can look at who holds a lock on the target relation other than AccessShareLock.
If heavyweight locks are not the culprit (as they usually are), then things get more difficult. In addition to regular or heavyweight locks, PostgreSQL has several other locking mechanisms, principally lightweight locks and spin locks. Unlike heavyweight locks, these types of locks do not show up in pg_locks, and they also do not participate in deadlock detection. While heavyweight locks have 8 different modes (see the documentation linked above), lightweight locks have only two modes (shared lock and exclusive lock) and spin locks have only one (you either locked it, or you didn't). These simplifications make acquiring light weight and spinlocks much faster than acquiring a heavyweight lock, and since we're careful about how these more restricted types of locks are used, they typically don't cause many problems. Spin locks, in particular, are typically only held for a handful of machine instructions, so a PostgreSQL process stuck on a spinlock would almost certainly indicate an outright bug. But because the code that gets executed while holding a spinlock is typically extremely simple, such bugs rarely occur in practice.
Lightweight locks can be a little more problematic. If you see a process that's definitely stopped (it's not consuming any CPU or I/O resources) and not waiting for a heavyweight lock, chances are good it's waiting for a lightweight lock. Unfortunately, PostgreSQL does not have great tools for diagnosing such problems; we probably need to work a little harder in this area. The most typical problem with lightweight locks, however, is not that a process acquires one and sits on it, blocking everyone else, but that the lock is heavily contended and there are a series of lockers who slow each other down. Forward progress continues, but throughput is reduced. Troubleshooting these situations is also difficult. One common case that is often easily fixed is contention on WALWriteLock, which can occur if the value of wal_buffers is too small, and can be remedied just by boosting the value.
Many problems in this area can also be fixed by inserting a connection pool such as pgpool-II or pgbouncer in front of PostgreSQL. Lightweight lock contention is often caused by too much overall concurrency. Rather than allowing 100 concurrent requests, configure your connection pooler to allow just 20, and make the rest wait until one of those 20 has finished. This doesn't necessarily degrade the user experience - if the requests only run for a short period of time - because the queries may run much more quickly once lock contention has been reduced to an acceptable level.
Future versions of PostgreSQL will hopefully continue to make improvements to scalability and concurrency, just as past releases have done. There have not been major changes to this machinery in the last few releases because previous changes mitigated most of the really serious problems to a degree that made them non-issues. However, as hardware capabilities continue to increase, some of those fixes are going to come unstuck, and some new problems will emerge. I expect to see this topic revisited over the next few years, as people continue to deploy PostgreSQL on increasingly powerful systems.
"This is because the SELECT statement has obtained a lock on table s. It's a very lightweight sort of of lock..."
ReplyDeleteI think that's a confusing choice of words. Although the lock strength is fairly weak (few conflicts); the lock level is called "heavyweight".
To clarify for other readers, I am trying to distinguish between the *level* of the lock (e.g. heavyweight lock, like those visible in pg_locks; versus LWLocks, etc.) and the strength of the lock (shared, exclusive, etc.).
I have the following scenario. One of the transactions seem to have got stuck ( it shows as idle) for about 5 days now. This has blocked the Vacuum from running and the vacuum job is in the ShareUpdateExclusiveLock mode.
ReplyDeleteHow do i go about recovering from the problem?
@Anonymous: I think you need to kill the idle transaction using pg_terminate_backend(). You might want to post to one of the mailing lists (pgsql-general or pgsql-admin, perhaps) for more help.
ReplyDeleteImproving performance is great, but what I'd like to see more of is visibility into what's happening. A way to expose more details on the spinlocks such as WALWriteLock can enable better debugging and tuning. As someone who has switched from using Oracle to PostgreSQL, I still feel partially blind as to what is happening in my databases :)
ReplyDeleteI wonder whether 'delete from' would work where 'truncate' hangs when a select statement produces a exclusive lock on the table?
ReplyDeleteYes, SELECT, INSERT, UPDATE, and DELETE all take mutually non-conflicting locks. But other operations like VACUUM and TRUNCATE take stronger locks.
DeleteOne interesting case is when you issue an ALTER TABLE statement when an open transaction is holding the target table in say ACCESS SHARE mode. The ALTER TABLE statement has to wait until the ACCESS SHARE lock is released, so the ACCESS EXCLUSIVE lock is not granted yet, but subsequent accesses to the target table are blocked as if the ACCESS EXCLUSIVE lock were granted. How do we explain this behaviour?
ReplyDelete