Thursday, March 13, 2014

write scalability for UPDATE operations

Yesterday, Heikki Linnakangas committed this patch:

commit a3115f0d9ec1ac93b82156535dc00b10172a4fe7
Author: Heikki Linnakangas
Date:   Wed Mar 12 22:46:04 2014 +0200

    Only WAL-log the modified portion in an UPDATE, if possible.
    When a row is updated, and the new tuple version is put on the same page as
    the old one, only WAL-log the part of the new tuple that's not identical to
    the old. This saves significantly on the amount of WAL that needs to be
    written, in the common case that most fields are not modified.
    Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.

This patch is the result of a lot of work, and a lot of testing, principally by Amit Kapila, but as the commit message says, also by Heikki, myself, and others.  So, how much does it help?

I have access a PostgreSQL community server, an IBM POWER box with 16 cores, 64 hardware threads, and about 64GB of memory.  The disk subsystem is not very good.  I've run pgbench tests on this system before: both write tests, and also read tests.  I reran the tests using the same configuration I've run in the past: Fedora 16, Linux 3.2.6-3.fc16.ppc64, shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, log_line_prefix = '%t [%p] '.  I ran each test three times, and the results below are the median of the three runs.  All tests here were done with scale factor 300.  Here's what I got:

1 client, pre-patch: tps = 1358.396212 (including connections establishing)
1 client, post-patch: tps = 1369.884076 (including connections establishing)
8 clients, pre-patch: tps = 9156.554118 (including connections establishing)
8 clients, post-patch: tps = 9410.429057 (including connections establishing)
32 clients, pre-patch: tps = 14315.968649 (including connections establishing)
32 clients, post-patch: tps = 15413.517388 (including connections establishing)

That's an 0.8% speedup at 1 client, a 2.7% speedup at 8 clients, and a 7.7% speedup at 32 clients - not bad.  It's interesting that the speedup is apparently positive even at one client, but seems to increase as the number of clients goes up.  I suspect that's indicative that we've still got lock contention around write-ahead logging, such that you come out a little ahead even at low client counts just because you do less I/O, but as the client count increases the reduced pressure on the relevant locks becomes the dominant effect.  However, I don't have conclusive evidence for this theory at this point.  Regardless, it's a nice performance improvement which should benefit many real-world use cases.

No comments:

Post a Comment