Monday, July 25, 2011

More On Read Scaling

In my previous blog post on read scaling out to 32 cores, I wrote a patch I recently committed to improve read scalability in PostgreSQL.  It turns out, however, that there's a downside to that patch, which was uncovered in testing by Stefan Kaltenbrunner.  (Fortunately, I have a fix.)  And, there's an opportunity for further performance improvement by applying a similar technique to an additional type of lock.  For full details, read on.

Stefan's discovery was that, on a system with many cores, with the fast relation lock patch I discussed in my previous blog post applied, it's possible to observe performance dropping off quite rapidly as the number of clients exceeds the number of cores.  For example, on one test, on a 32-core machine, increasing the number of clients from 32 to 80 resulted in a drop-off from SELECT-only 178,000 tps to 132,000 tps.  While performance has always dropped off at high client counts, the fast relation lock has made the drop-off sharper.

After a fair amount of testing and experimentation, I figured out that the problem was with something called the shared invalidation queue.  For performance reasons, PostgreSQL caches a variety of data in each individual backend, to avoid the overhead of repeatedly reading it from the system tables.  Most importantly, it caches table definitions.  But this requires some mechanism for knowing when these caches are no longer valid, and that mechanism is the shared invalidation queue.  Each backend must, at various times, but in particular whenever it locks a table, check for new shared invalidation messages.  In a workload with many short transactions, this means that backends are checking for shared invalidation messages almost constantly.  These checks aren't terribly expensive on machines with only a few CPUs, but the price increases as the number of CPUs go up and the relevant spinlocks become heavily contended.  Adjusting the code to avoid taking locks in the common case where no new invalidation messages are present both improves peak performance and, at least in the cases I've tested so far, eliminates the drop-off at high client counts.

I've got one more patch in the works that improves performance still further.  While I was working on the "fast relation locks" patch, I noticed that there was quite a bit of other lock manager traffic, and I wondered what it all was.  As it turns out, each transaction was taking an exclusive lock on its own "virtual transaction ID".  PostgreSQL only assigns a transaction ID to transactions which write data, but every transaction has a virtual transaction ID.  The owner locks this virtual transaction ID in exclusive mode, and other backends can take a share lock on the virtual transaction ID if they need to wait for that transaction to complete.  But why would you ever need to do that?  It's easy to understand that you might need to wait if, for example, you wanted to update a tuple that some other transaction was already in the midst of updating.  But that's not relevant for a transaction that only writes data, so you might wonder why this mechanism is needed at all.

As it turns out, it's used in just two places.  First, CREATE INDEX CONCURRENTLY uses it.  At two different points during the concurrent index build, it waits for all currently running transactions (including read-only ones) to complete.  Second, Hot Standby uses it to kill off transactions that are preventing recovery from progressing (for example, by holding open a snapshot that can see a tuple which recovery wants to remove).  However, the vast majority of transactions will lock and unlock their virtual transaction ID without anyone ever waiting on it.  So, I've written a patch which causes the lock a transaction takes on its own virtual transaction ID to spring into existence only when someone wants to wait for that transaction to complete.  The rest of the time, the lock never needs to be created (or cleaned up).  The benefit of this patch isn't nearly as large as that of the main "fast relation locks" patch, but it's still a nice shot in the arm on a 32-core system - about an 11% improvement on pgbench -n -S -c 32 -j 32, on one test.

I'm going to be continuing to work heavily on performance over the next several months.  While there's probably more work that could be done on read scalability, I feel that with the patch I mentioned in my last blog post and the two discussed here, the problem is well enough solved that it makes sense for me to turn my attention to other workloads.  I'll blog about that work as it progresses.

4 comments:

  1. Hey Robert,

    I have almost-daily dbt5 (tpc-e derivative) tests against the 9.1 and 9.2 branches limping along for almost a month now:

    http://207.173.203.223/~markwkm/community6/dbt5/dailyv01/dbt5.png

    My understanding is that we could get twice the transaction rate, at the current scale factor, if the efficiency of the lock manager is improved.

    If there any interest for you here, I'll do whatever I can to get you the data you need.

    Mark

    ReplyDelete
  2. It's hard to accept that we have to wait for PG 9.2 for these improvements.

    ReplyDelete
  3. I think also new patches need to be reviewed with performance in mind. Like you mentioned, some locks are used by new features and that makes it harder to change them.

    ReplyDelete
  4. Actually, this is the perfect time to be making changes to this area. Some of the subtle bits Robert has identified here--like the potential issues with CREATE INDEX CONCURRENTLY and Hot Standby--make tweaking this code a scary thing to do. I feel comfortable about this happening now, with >75% fo the release cycle still left for unanticipated changes to fall out of testing. This is not code I'd want to see show at at the end of a release cycle and ship without a long testing period first.

    ReplyDelete