Heikki Linnakangas was doing some benchmarking last week and discovered something surprising: in some circumstances, unlogged tables were actually slower than permanent tables. Upon examination, he discovered that the problem was caused by CLOG contention, due to hint bits not being set soon enough. This leads to a few questions:
1. What is CLOG?
2. What are hint bits?
3. How does setting hint bits prevent CLOG contention?
4. Why weren't hint bits being set sooner?
Let's take those in order.
PostgreSQL uses a data structure called CLOG to keep track of which transaction have committed and which transactions have aborted. For example, if a transaction inserts a large number of tuples and then aborts, we don't immediately get rid of those tuples; VACUUM will do it the next time it processes the table, which might not happen for some time, if there is little other write activity on that table. In the meantime, anyone who looks at those tuples needs to realize that they were inserted by an aborted transaction and are therefore not visible. The canonical source for this information is the CLOG, which stores two bits for each transaction in what is effectively a giant array, indicating whether the transaction is committed, aborted, or apparently still in progress. (The fourth state is called "sub-committed", and is used transiently when committing transactions which contain subtransactions.)
Unfortunately, access to CLOG is relatively expensive. A limited number of CLOG pages can be buffered in memory, but in the worst case accessing CLOG requires a read from disk. Furthermore, because every process in the system potentially needs to know the commit status of transactions which may well all have their status on the same page, lock contention can become a serious problem. To mitigate this problem, once a transaction is known to be committed or aborted, we start setting "hint bits" on the tuples touched by that transaction. For example, in the above-mentioned case of a transaction that inserts some tuples and aborts, once the transaction has aborted, the next processes that access those tuples will look up the transaction ID, see that it has aborted, and set hint bits on the tuples indicating that they were inserted by an aborted transaction. The next process to come along and examine those same tuples will know immediately that they aren't visible, without needing to consult CLOG. Similarly, if the transaction had committed, subsequent processes which examined those same tuples could set hint bits indicating that the insertion transaction committed, which would again save future visitors to the page the trouble of consulting CLOG.
However, hint bits require some special handling when used in combination with another interesting PostgreSQL feature, asynchronous commit. For some applications, such as anything involving money, it's absolutely necessary that when a transaction commits, the commit is durably on disk and can't be lost even if a crash happens immediately afterward. But for other applications, such as monitoring data, it's often OK to lose a few seconds worth of data in the event of a system crash; after all, had the crash happened slightly sooner, you would have lost that data anyway. In such cases, it's often possible to get a big performance boost by setting synchronous_commit=off. Instead of waiting for the commit record to be physically written to disk, we kick off the write in the background (making sure that it will complete within a few hundred milliseconds or so). If the system crashes during that time, the transaction will be lost, but otherwise it will be durably committed.
This design, although very clever and extremely useful in many real-world situations, is problematic for hint bits. Suppose we start setting hint bits for a transaction, saying that it was committed, and then the system crashes. When the system comes back up, the transaction is now considered aborted - but we've got hint bits out there claiming it's committed, so some parts of the transaction may appear committed while other parts appear aborted, or worse, some parts of the system may think the transaction is committed while other parts think it's aborted. That's unacceptable. So in the case of transactions that commit asynchronously, we have to wait until the commit record is durably on disk before we can start setting hint bits. That way, we know that we won't need to reverse course after the fact. In the meantime, that is to say, in the perhaps three-quarters of a second that we're waiting for the commit record to make it to disk, we'll have to forebear setting hint bits; each process that accesses those tuples will need to do its own CLOG lookup to check the transaction status.
That shouldn't be too expensive, right? Three quarters of a second is pretty short.
Well... as it turns out, not always. If those tuples are being updated sufficiently frequently by enough simultaneous processes, it gets pretty expensive. I found that in pgbench tests with scale factor 15, using unlogged tables, I could double performance with at 32 concurrent clients by setting the hint bits immediately rather than waiting for the commit record to hit the disk. And, fortunately, for unlogged and temporary tables, that is safe, since the entire contents of the table (including the bogus hint bits) will be lost anyway on a crash, so my patch to do that is now committed, and (barring yet-to-be-discovered bugs) will appear in PostgreSQL 9.2. If you're hitting this issue on PostgreSQL 9.1, a workaround is to reduce the value of wal_writer_delay, since that will reduce the amount of time it takes for the commit record to make it out to disk.
This problem is particularly acute for unlogged tables because, beginning in PostgreSQL 9.1, any transactions that touch only temporary or unlogged tables always use asynchronous commit, so you can hit this case there even if you have configured synchronous_commit=on. However, if you've configured synchronous_commit=off, you can also encounter this problem with permanent tables. In many cases, the performance benefit you get from not waiting for the commit record to be flushed all the way out to disk will vastly outweigh the downsides of waiting slightly longer to set hint bits - but it would be nice to get the best of both worlds.