Last week, I attended the Linux Storage, Filesystems, and Memory Management summit (LSF/MM) on Monday and Tuesday, and the Linux Collaboration Summit (aka Collab) from Wednesday through Friday. Both events were held at the Meritage Resort in Napa, CA. This was by invitation of some Linux developers who wanted to find out more about what PostgreSQL needs from the Linux kernel. Andres Freund and I attended on behalf of the PostgreSQL community; Josh Berkus was present for part of the time as well.
My overall impression is that it was a good week, except that by Thursday the combination of 14 hour days and jet lag were catching up with me in a big way. However, from the point of view of the PostgreSQL project, I think it was very positive. On Monday, Andres and I had an hour-and-a-half slot; we used about an hour and fifteen minutes of that time. Our big complaint was with the Linux kernel's fsync behavior, but we talked about some other issues as well, including double buffering, transparent huge pages, and zone reclaim mode.
Since I've already written a whole blog post (linked above) about the fsync issues, I won't dwell further on that here, except to say that our explanation prompted some good discussion and I think that the developers in the room understood the problem we were complaining about and felt that it was a real problem which deserved to be addressed. The discussion of double-buffering was somewhat less satisfying; I don't think it's very clear what the best way forward is there. One possible solution is to have a way for PostgreSQL to evict pages from its cache back into the kernel cache without marking them dirty, but this is quite understandably scary from the kernel's point of view, and I'm not very sure that the performance would be good anyway.
On the topic of transparent huge pages (THP), we made the point, already well-known to many PostgreSQL users, that they destroy performance on some PostgreSQL workloads. When we see users with excessive system time usage, we simply recommend that they shut THP off. This problem was familiar to many; apparently, the code for transparent huge pages is quite complex and is known to cause problems for some other workloads as well. It has been improved in more recent kernels, so the problem cases may now be fewer, but it is far from clear that all of the bugs have been swatted. One interesting idea that was floated was to add some sort of mutex so that only one process will attempt THP compaction at a time, to prevent the time spent compacting from ballooning out of control on machines with many processors. If you are about to do THP compaction but the mutex is held by another process, don't wait for the mutex, but just skip compaction.
On the topic of zone reclaim mode, nearly everyone seemed to agree that the current kernel behavior of setting vm.zone_reclaim_mode to 1 on some systems hurts more people than it helps. No one objected to the idea of changing the kernel so that 0 is always the default. A setting of 1 can improve things for certain workloads where the whole working set fits within a single memory node, but most people (and certainly all the database people in the room) seemed to feel that was a relatively uncommon scenario.
Before and after our Monday session, I got a chance to hear about some other kernel efforts that were underway. There was discussion of whether 32-bit systems needed to be able to handle disk drives with more than 2^32 4K pages (i.e. >16TB), with the conclusion being that it might make sense to support accessing files on such filesystems with O_DIRECT, but reworking the kernel page cache to support it more fully was probably not sensible. Among other problems, a 32-bit system won't have enough memory to fsck such a volume. Persistent memory, which does not lose state when the system loses power, was also discussed. I learned about shingled magnetic recording (SMR), a technique created to work around the fact that drive write heads can't be made much smaller and still write readable data. Such drives will have a write head larger than the read head, and each track will partially overwrite the previous track. The drive is therefore divided into zones, each of which can be written in append-only fashion, or the whole zone can be erased. This presents new challenges for filesystem developers (and will doubtless work terribly for database workloads!). Dave Jones talked about a tool called Trinity, which makes random Linux system calls in an attempt to crash the kernel. He's been very successful at crashing the kernel this way; many bugs have been found and fixed.
On Tuesday, there was nothing specific to PostgreSQL, but there were discussions of transparent huge pages, the memcg (Linux memory controller group) interface, NUMA-related issues, and more discussion of topics from Monday. On Wednesday, we went from 80 or so people for LSF/MM to maybe 400 for the Linux Collaboration summit - those numbers might be off; I'm just guessing, but there were certainly a lot more people there. Most of the day was taken up with keynotes, including corporate attempts to promote open source, an interesting-sounding project called AllJoyn, a talk on container virtualization, and more. I was interested by the fact that a significant number of attendees were not technical; for example, there was an entire legal track, about trademarks, licensing, and so on.
On Thursday, Andres and I had another opportunity to talk about PostgreSQL. This was cast as a broader discussion that would include not only PostgreSQL developers but also MySQL, MariaDB, and MongoDB developers, as well as the LSF/MM kernel developers. It seemed to me that, for the most part, we're all struggling with the same set of issues, although in slightly different ways. The MongoDB developer explained that MongoDB uses mmap() with the MAP_PRIVATE flag to create their private cache, equivalent to shared_buffers; to minimize double buffering, they occasionally unmap and remap entire files. They use a second set of memory mappings, this one with MAP_SHARED, to copy changes back to disk, mirroring our checkpoint process. They weren't quite sure whether the Linux kernel was to blame for the performance problems they were seeing while performing that operation, but their description of the problem matched what we've seen with PostgreSQL quite closely. Several developers from forks of MySQL were also present, and reported similar problems in their environment also. The databases vary somewhat in how they interact with the kernel: MySQL uses direct I/O, we use buffered I/O via read() and write(), and MongoDB uses mmap(). Despite that, there was some unity of concerns.
Aside from the database-related stuff, the most interesting session I attended on Thursday was one by Steve Rostedt, regarding a kernel facility called ftrace of which I had not previously been aware. I'm not sure how useful this will be for debugging PostgreSQL problems just because of the volume of output it creates; I think that for many purposes perf will remain my tool of choice. Nevertheless, ftrace has some very interesting capabilities that I think could be used to uncover details that can't be extracted via perf. I'm intrigued by the possibility of using it to get better latency measurements than what I've been able to get out of perf, and there may be other applications as well once I think about it more.
There wasn't much left to talk about on Friday; quite a few people left Thursday night, or Friday morning, I think. But I spent some time listening to Roland McGrath talk about ongoing work on glibc, and there were some more talks about NUMA issues that I found interesting.
There's more to say, but this blog post is too long already, so I'd better stop writing before everyone stops reading. In closing, I'd like to thank Mel Gorman very much for inviting us and for his very positive attitude about addressing some of the problems that matter to PostgreSQL and other databases, and I'd also like to thank Dave Chinner, James Bottomley, Jan Kara, Rik van Riel, and everyone else whose names I am unfortunately forgetting for their interest in this topic. Thanks!
THP compaction was killing our performance. THP was the trigger, but our session management is the real issue. Many of our connections are unpooled so when compaction occurred our sessions went through the roof causing massive lightweight lock contention.
ReplyDelete