As some of you probably already know from following the traffic on pgsql-hackers, I've been continuing to beat away at the scalability issues around PostgreSQL. Interestingly, the last two problems I found turned out, somewhat unexpectedly, not to be internal bottlenecks in PostgreSQL. Instead, they were bottlenecks with other software with which PostgreSQL was interacting during the test runs.
I ran some SELECT-only pgbench tests on a 64-core server and noticed something odd: pgbench was consuming very large amounts of system time. The problem turned out to be that pgbench calls random(). Since random() does not take a random seed as an argument, it has to rely on some kind of global state, and is therefore not inherently thread-safe. glibc handles this by wrapping a mutex around it - on Linux, this is a futex - so that executions of random() are serialized across all threads. This works, but it slows things down significantly even on a 32-core machine, and on a 64-core machine, the slowdown is even more. I doubt this was effect was visible on earlier releases of PostgreSQL, because the bottlenecks on the server side limited the throughput far more severely than anything pgbench was doing. But the effect is now visible in a SELECT-only pgbench test, provided you have enough CPUs.
Tom Lane and I had a somewhat protracted discussion about what to do about this, and eventually decided to forget about using any OS-supplied random number generator, and instead use our own implementation of erand48(), now renaming pg_erand48(). This takes the random state data as an argument and is therefore both thread-safe and lock-free.
With that problem out of the way, the bottleneck shifted to the server side. pgbench was now humming along without a problem, but PostgreSQL was now using enormous amounts of CPU time. It took a while to track this down, but the bottleneck turned out to be the Linux kernel's implementation of lseek. Linux protects each inode with a mutex, and PostgreSQL uses lseek - which takes the mutex lock - to find the length of the file for query planning purposes. With enough clients, this mutex can become badly contended. This effect is, I believe, measurable even on a 32-core box, but it's not severe. However, on the 64-core server I tested on, it led to a complete collapse in performance beyond 40 clients. When I ran pgbench with the "-M prepared" option, which avoids replanning the query and therefore doesn't repeatedly invoke lseek, the performance collapse vanishes. There is still some degradation due to other contention problems, but it's not nearly as bad.
As it turns out, I'm not the first person to run into this problem: the MOSBENCH guys at MIT, hacking on their modified version of PostgreSQL, ran into it last fall. They, too, described the performance as "collapsing" due to lseek contention. I thought they were exaggerating, but if anything they were understanding the extent of the problem. On this 64-core server, going from 40 clients to 56 clients lead to more than a sevenfold drop in aggregate throughput. This problem was largely masked in earlier releases of PostgreSQL by the bottlenecks in our lock manager; but, as I blogged about before, those problems are now fixed. So, in PostgreSQL 9.2devel, it's pretty easy to hit this problem. You just need enough CPUs.
It's not yet clear to me exactly how we're going to solve or work around this problem. It would be nice to see it fixed in the Linux kernel, because surely this is an issue that could also affect other applications. On the other hand, it would also be nice to see it fixed in PostgreSQL, because it doesn't seem inconceivable that it could affect other kernels. Fixing it in PostgreSQL would presumably mean interposing some type of cache, designed in such a way as to avoid having the cache - or any single cache entry - protected by a single spinlock that 40+ CPUs can go nuts fighting over.