In addition to talking about PostgreSQL at LSF/MM and Collab, I also learned a few things about the Linux kernel that I had not known before, some of which could have implications for PostgreSQL performance. These are issues which I haven't heard discussed before in the PostgreSQL community, and they are somewhat subtle, so I thought it would be worth writing about them.
1. Regardless of how you set vm.zone_reclaim_mode, the kernel page cache
always prefers to use pages from the local NUMA node. Many PostgreSQL community members have already determined that vm.zone_reclaim_mode = 1 is bad for PostgreSQL workloads, because the system will essentially always allocate from the local node even if large amounts of memory are available on other nodes. What I learned last week is that even with vm.zone_reclaim_mode = 0, we're not out of the woods: the kernel will overflow to another node when no pages are available, but it will also wake up kswapd to reclaim more pages on the local node. This can result in the reclaim rate being much higher on the local node than on other nodes, so that relatively hot pages on the local node are evicted in preference to relatively cold pages on other nodes. This probably won't cause big problems for pgbench-type workloads, where we run lots of short queries concurrently, but it could cause problems when a single process, or several processes on the same node, touch a large number of pages and some of them more than once. Also, it might not be obvious that the problem traced back to suboptimal page eviction decisions on the part of the kernel. I fear that there could be a significant number of users suffering from this issue who aren't able to diagnose this as the cause.
2. If you perform a buffered (that is, no O_DIRECT) write of a complete block, and that block is not in the page cache, there is some suggestion that Linux will read the block back from disk before overwriting it. Actually, two different kernel developers gave us opposite answers about whether or not Linux is smart enough to optimize away the reads, but testing by Andres Freund seemed to suggest that, whether for this reason or some other, writing data not in cache can lead to a large volume of read I/O, equal to the write I/O. Some reads might be expected, since metadata might need to be brought into cache, but if only metadata is being read, this should be small compared to the volume of writes. I'm still on the fence about whether this is a real problem, but if it is, it will hurt people who set large values of shared_buffers in the hopes of making their entire workload stay within PostgreSQL's cache: unless there's still enough memory remaining at the OS-level to double-buffer everything that lives in our cache, such users will incur additional (unnecessary) read I/O.
3. Not that it's particularly bad for PostgreSQL, but as a point of possible future interest, I found out that the Linux kernel does not (and most likely will not in the future) recommend mixing O_DIRECT and non-O_DIRECT I/O on a single file. Doing this is expected to result in very poor performance, because the use of O_DIRECT causes page cache invalidations. I haven't been able to think of a scenario in which this will actually hurt PostgreSQL users today, because we only use O_DIRECT for WAL, and then only if wal_sync_method = open_sync or open_datasync, and then only neither archiving nor streaming replication is in use. However, it's certainly worth keeping in mind if we ever consider expanding the use of O_DIRECT.
Again, thanks to all of the kernel folks who spent time discussing these and other issues with us last week.