Thursday, December 15, 2011

Write Scalability

Time flies when you're benchmarking.  I noticed today that it's been over a month since my last blog post, so it's past time for an update.  One of the great things about the PostgreSQL community is that it is full of smart people.  One of them is my colleague Pavan Deolasee, who came up with a great idea for reducing contention on one of PostgreSQL's most heavily-trafficked locks: ProcArrayLock.  Heikki Linnakangas (another really smart guy, who is also a colleague of mine) did some more work on the patch, and then I cleaned it up further and committed it.

I'll come clean and admit that I didn't think Pavan's approach would work at all.  This is mostly because, when I was working on read scalability over the summer, I found that trying to reduce the time for which a contended lock was held didn't amount to much.  Avoiding the lock contention was the only way to win the game.  In this case, though, it worked: really well.  On one set of tests, with 32 clients, I measured a 38% speedup on pgbench at scale factor 100; on unlogged tables, I measured a 66% speedup on the same test.  Those are very good numbers.

The ultimate goal here is to get linear scalability on write workloads.  How close are we?  Since the test machine I used in this case had 32 cores, we can get a pretty good picture of that by dividing the 32-client throughput by the single-client throughput.  If we get 32 exactly, that's linear scalability.  If we get more, that's better-than-linear scalability.  If we get less, that's less-than-linear scalability.  The lower the value, the worse we scale.

On the tests mentioned above, with the patch included, this ratio comes out to 22.7 for permanent tables and 26.2 for unlogged tables.  Before the patch, the corresponding values were 16.7 and 16.1.  So, we're clearly not all the way there yet.  On the other hand, for the price of a little reorganization of shared memory, we've clearly improved things quite a lot.  (The patch achieves its remarkable speedup just by packing the hot portions of a large data structure into the minimal number of CPU cache lines.)

I think I was lucky, when working on read scalability,  to find that there was basically only one bottleneck.  In the area of write scalability, there are three: ProcArrayLock, WALInsertLock, and CLogControlLock.   All of these affect each other.  Anything that reduces the pressure on one lock (and thereby speeds up the system) increases pressure on the other two (and thereby slows down the system).  This has made it much harder to measure the effectiveness of small optimizations.


  1. Great stuff.
    But most projects would be worried when all of the sudden things are so much faster.
    This definitely needs more testing.

  2. Wouldn't WALWriteLock also be another contection point for Write scalability in case of synchronous commit?

  3. @Amit Kapila: Yes, that's very possible. I haven't done much investigation of that as yet; throughput is often limited more by the fact that disks are slow than by the behavior of the lock itself. I think.

  4. Ideally what you are telling should be right. However today after your comment when I again looked into code, I observed that in XLogFlush(), after taking WALWriteLock, it conditionaly aquire WALInsertLock. This can lead to more severe bottleneck for WALInsertLock.

  5. I think the disk latency is the most important factor for scaling writes?

  6. Is there anyone not running a RDBMS off a battery backed raid-controller til 1GB of memory in write-back mode nowadays?

    Write latency is a problem of the past I think..