Robert Haas: Subtly Bad Things Linux May Be Doing To PostgreSQL

Wednesday, April 02, 2014

Subtly Bad Things Linux May Be Doing To PostgreSQL

In addition to talking about PostgreSQL at LSF/MM and Collab, I also learned a few things about the Linux kernel that I had not known before, some of which could have implications for PostgreSQL performance. These are issues which I haven't heard discussed before in the PostgreSQL community, and they are somewhat subtle, so I thought it would be worth writing about them.

1. Regardless of how you set vm.zone_reclaim_mode, the kernel page cache always prefers to use pages from the local NUMA node. Many PostgreSQL community members have already determined that vm.zone_reclaim_mode = 1 is bad for PostgreSQL workloads, because the system will essentially always allocate from the local node even if large amounts of memory are available on other nodes. What I learned last week is that even with vm.zone_reclaim_mode = 0, we're not out of the woods: the kernel will overflow to another node when no pages are available, but it will also wake up kswapd to reclaim more pages on the local node. This can result in the reclaim rate being much higher on the local node than on other nodes, so that relatively hot pages on the local node are evicted in preference to relatively cold pages on other nodes. This probably won't cause big problems for pgbench-type workloads, where we run lots of short queries concurrently, but it could cause problems when a single process, or several processes on the same node, touch a large number of pages and some of them more than once. Also, it might not be obvious that the problem traced back to suboptimal page eviction decisions on the part of the kernel. I fear that there could be a significant number of users suffering from this issue who aren't able to diagnose this as the cause.

2. If you perform a buffered (that is, no O_DIRECT) write of a complete block, and that block is not in the page cache, there is some suggestion that Linux will read the block back from disk before overwriting it. Actually, two different kernel developers gave us opposite answers about whether or not Linux is smart enough to optimize away the reads, but testing by Andres Freund seemed to suggest that, whether for this reason or some other, writing data not in cache can lead to a large volume of read I/O, equal to the write I/O. Some reads might be expected, since metadata might need to be brought into cache, but if only metadata is being read, this should be small compared to the volume of writes. I'm still on the fence about whether this is a real problem, but if it is, it will hurt people who set large values of shared_buffers in the hopes of making their entire workload stay within PostgreSQL's cache: unless there's still enough memory remaining at the OS-level to double-buffer everything that lives in our cache, such users will incur additional (unnecessary) read I/O.

3. Not that it's particularly bad for PostgreSQL, but as a point of possible future interest, I found out that the Linux kernel does not (and most likely will not in the future) recommend mixing O_DIRECT and non-O_DIRECT I/O on a single file. Doing this is expected to result in very poor performance, because the use of O_DIRECT causes page cache invalidations. I haven't been able to think of a scenario in which this will actually hurt PostgreSQL users today, because we only use O_DIRECT for WAL, and then only if wal_sync_method = open_sync or open_datasync, and then only neither archiving nor streaming replication is in use. However, it's certainly worth keeping in mind if we ever consider expanding the use of O_DIRECT.

Again, thanks to all of the kernel folks who spent time discussing these and other issues with us last week.

16 comments:

Greg Sabino MullaneApril 02, 2014 2:17 PM
So with regards to number 1, what are the kernel devs doing about it, and/or what are the workarounds?
ReplyDelete
Replies
Kevin BowlingApril 03, 2014 4:12 AM
I'd love to see a FreeBSD kernel hacker chime in here.

ZFS enables some interesting things for pgsql:

* http://open-zfs.org/wiki/Performance_tuning#PostgreSQL - it seems like the primarycache setting prevents the double buffering problem that Linux' page cache has
* http://citusdata.com/blog/64-zfs-compression
ReplyDelete
Replies
AnonymousApril 03, 2014 6:17 AM
SmartOS?
ReplyDelete
Replies
Michael RennerApril 03, 2014 9:46 AM
Robert, thanks for the writeup!

What is the exact scenario that triggers #2?

You've got a page in shared_buffers, you want to flush it to disk, Kernel notices that page is _NOT_ in page cache _BUT_ in shared buffers and then triggers a superfluous read before writing the page? Or is this unrelated to shared_buffers?
ReplyDelete
Replies
Merlin MoncureApril 05, 2014 1:19 PM
THP compaction should be mentioned. I have strong circumstantial evidence (but not proof) that it is causing issues with high memory systems.
ReplyDelete
Replies
AnonymousApril 06, 2014 6:24 AM
The fsync() issue sounds like a much bigger performance problem for a database app than NUMA scheduling or cache misses.

I wonder if one could put the WAL files on one disk and the actual db files on another. That way queries would not block when committing.

Though I believe this is only an issue for read AND write intensive databases. Most db setups would be either or, again imho.
ReplyDelete
Replies
Scott MarloweJune 11, 2014 12:54 AM
So I've noticed my kswapd waking up and chewing up tons of CPU and swapping like mad before on a machine with 512GB RAM. The only solution on this class machine is to turn off swap, which fixed it immediately. Could this be related to #1 in your list?
ReplyDelete
Replies

Add comment