Robert Haas: Linux's fsync() woes are getting some attention

Monday, March 10, 2014

Linux's fsync() woes are getting some attention

In two weeks, I'm headed to LSF/MM and the Linux Collaboration Summit, by invitation of some Linux kernel hackers, to discuss how the Linux kernel can better interoperate with PostgreSQL. This is good news for PostgreSQL, and hopefully for Linux as well. A post from Mel Gorman indicates that this topic is attracting a lot of interest, and that MariaDB and MySQL developers have now been invited to participate as well. His summary of the discussion so far quotes some blunt words from one of my posts:

 IMHO, the problem is simpler than that: no single process should
 be allowed to completely screw over every other process on the
 system.  When the checkpointer process starts calling fsync(), the
 system begins writing out the data that needs to be fsync()'d so
 aggressively that service times for I/O requests from other process
 go through the roof.  It's difficult for me to imagine that any
 application on any I/O scheduler is ever happy with that behavior.
 We shouldn't need to sprinkle of fsync() calls with special magic
 juju sauce that says "hey, when you do this, could you try to avoid
 causing the rest of the system to COMPLETELY GRIND TO A HALT?".
 That should be the *default* behavior, if not the *only* behavior.

Long-time PostgreSQL users will probably be familiar with the pain in this area. If the kernel doesn't write back pages aggressively enough between and during checkpoints, then at the end of the checkpoint, when the fsync() requests start arriving in quick succession, system throughput goes down the tubes as every available write cache fills up and overflows. I/O service times become very long, and overall performance tanks, sometimes for minutes. A couple of things have been tried over the years to ameliorate this problem. Beginning in PostgreSQL 8.3 (2008), the checkpoint writes are spread out over a long period of time to give the kernel more time to write them back, but whether it actually does is up to the kernel. More recently, attempts have been made to spread out the fsync() calls as well as the writes, but I'm not aware that any of these attempts have had fully satisfying results, and no changes along these lines have been judged sufficiently promising to justify committing them. In one sense, what PostgreSQL really wants to know is whether starting the next fsync() now is going to cause the I/O subsystem to become overloaded, and there's no easy way to get that information; in fact, it's not clear that even the kernel has access to that information. (If it did, we'd also hope that it would take care of throttling the fsyncs a little better even in the absence of specific guidance from PostgreSQL.)

The other major innovation that I think has been broadly useful to PostgreSQL users, dirty_background_bytes, dates to 2009. This was an improvement over the older dirty_background_ratio due to the fact that the latter couldn't be set low enough to keep the dirty portion of the kernel's write cache as small as PostgreSQL needed it to be. But it's unclear to what extent any further progress have been made since then. Mel Gorman points out in his post that the behavior in this area was changed significantly in Linux 3.2, but many PostgreSQL users are still running older kernels that don't include these changes. RHEL 6 ships with the 2.6.32 kernel, which does not incorporate these changes, and RHEL 7, which is apparently slated to ship with 3.10, is still in beta.

It seems clear based on recent discussions that the Linux developer community is willing to consider changes that would make things better for PostgreSQL, but their ability to do so may be hindered by a lack of good information. It seems unlikely that all of the problems in this area have been fixed in newer releases, but more and better information is needed on the extent to which they have or have not. Perhaps someone's already done this research and I'm simply not aware of it; pointers are appreciated.

15 comments:

PlatypusMarch 10, 2014 5:49 PM
Good luck. I've gotten conflicting answers to similar questions. In private email, one of the XFS developers assured me that fsync entanglement is an immutable fact of life, that it's impossible for them to provide any information other than "everything in the universe has been flushed", and anyone who asks for more is an idiot. On the other hand, at the Linux filesystem developer summit after FAST, both XFS and ext4 developers seemed to think that providing information about the current journal commit number would be easy. Getting information about *how much* data is tied to the current journal commit is probably somewhere in between those two extremes, but maybe if enough people approach the problem enough different ways something useful will happen.
ReplyDelete
Replies
Theodore TsoMarch 10, 2014 5:49 PM
It should be noted that he followed your quote with the following caveat:

"It is important to keep this in mind although sometimes the ordering
requirements of the filesystem may make it impossible to achieve."

If you are writing to blocks which have been pre-allocatd and pre-initialized, doing this via fdatasync(2) is actually pretty easy. If however you are writing to newly allocated blocks, to a newly created file (this is how sqllite works; a check point operation copies the whole d*mned database to a newly created file as part of its checkpoint operation), it's going to be pretty hard to make fsync(2) not trash other processes' I/O latencies.
ReplyDelete
Replies
Theodore TsoMarch 11, 2014 9:45 AM
One other thiing to note --- a lot of what you are talking about vis-a-vis writeback behaviours is something where we can certainly do a better job. In particular the fact that the writeback daemons don't start pushing blocks to disk until a percentage of available memory contains dirty blocks (which might have worked well when systems had hundreds of megabytes, but not now when systems with hundreds of gigabytes are not unheard of), is definitely a problem.

Part of the reason why it's been unsolved for such a long time is that writeback strategies are kind of in a no-man's land. Half the code is in fs/fs-writeback.c, and the other half is in mm/page-writeback.c, and so it doesn't get enough love, since the file system people don't really consider writeback to be a fs topic, and the mm people don't consider part of the core mm responsibilities --- which is why certain problems like automatically tuning when dirty pages should be staged to the HDD doesn't really get that much love.

This is completely different from what most people complain about when they talk about what has sometimes been colloquially called the "O_PONIES" problems, which at the end of the day is very much wrapped up with the fact that the file system can't give you efficient notification of when writes might be on stable store, because we don't get that notification from the disk unless we do a CACHE FLUSH operation. (And this is where handing back the commit ID is something we can do, if people are willing to live with getting a notification after the 5 second ext3/4 commit timeout has happened, such that we've sent a CACHE FLUSH command for other purposes.)

So when you say "fsync" problem, it may be misleading to some people. The question of tuning writeback, so that a lot of the work is done before the fsync() is issued, is a bit different from other people's complaints about fsync(2) being too slow, so they want to skip the fsync() call, and yet they still complain when they lose data (hence the quip that what they application programmers really want is the O_PONIES open flag).
ReplyDelete
Replies
Theodore TsoMarch 11, 2014 9:35 PM
Something that might be useful for this use case is the Linux system call sync_file_range(2). It's Linux specific, and it would require the userspace application to understand how many pages to submit at a time to avoid swamping the I/O subsystem. But one thing that would require any new kernel code that might help make things better is to use ionice(2) in the database writeback thread, and then using sync_file_range() in a loop to issue writeback in chunks if a few megabytes at a time, and using the SYNC_FILE_RANGE_WAIT_BEFORE to throttle the calls to sync_file_range().

Yes, it's manual, and arguably the kernel should be able to do a better job automatically. But if you want to work around kernel behaviour for currently released distribution kernels, it might be something that's worth exploring.
ReplyDelete
Replies
AnonymousApril 02, 2014 9:53 PM
I am a senior performance engineer for Nokia. This is an issue that has been driving me nuts for a long time. I hope it is addressed soon. We write a lot of log data on our systems, and at times the I/O wait times can bring a web server or proxy browser to its knees, adversely affecting a lot of our customers, of which there are about 100 million... I'm dealing with just such a problem right now, but unless I want to start hacking the kernel (our operations people would not be happy with that), I can only provide band-aids - no real solutions.
ReplyDelete
Replies
AnonymousMay 27, 2014 2:17 PM
THP compaction on Linux 6+ causing this?

http://feed.askmaclean.com/archives/linux-6-transparent-huge-pages-and-hadoop-workloads.html
ReplyDelete
Replies
UnknownAugust 23, 2014 10:27 PM
Hi, in your bog you mentioned that
"More recently, attempts have been made to spread out the fsync() calls as well as the writes, but I'm not aware that any of these attempts have had fully satisfying results, and no changes along these lines have been judged sufficiently promising to justify committing them. "
I am wondering if you could point me to some reference about this?

Thanks!
ReplyDelete
Replies

Add comment