Monday, March 10, 2014

Linux's fsync() woes are getting some attention

In two weeks, I'm headed to LSF/MM and the Linux Collaboration Summit, by invitation of some Linux kernel hackers, to discuss how the Linux kernel can better interoperate with PostgreSQL.  This is good news for PostgreSQL, and hopefully for Linux as well.  A post from Mel Gorman indicates that this topic is attracting a lot of interest, and that MariaDB and MySQL developers have now been invited to participate as well.  His summary of the discussion so far quotes some blunt words from one of my posts:

 IMHO, the problem is simpler than that: no single process should
 be allowed to completely screw over every other process on the
 system.  When the checkpointer process starts calling fsync(), the
 system begins writing out the data that needs to be fsync()'d so
 aggressively that service times for I/O requests from other process
 go through the roof.  It's difficult for me to imagine that any
 application on any I/O scheduler is ever happy with that behavior.
 We shouldn't need to sprinkle of fsync() calls with special magic
 juju sauce that says "hey, when you do this, could you try to avoid
 causing the rest of the system to COMPLETELY GRIND TO A HALT?".
 That should be the *default* behavior, if not the *only* behavior. 

Long-time PostgreSQL users will probably be familiar with the pain in this area.  If the kernel doesn't write back pages aggressively enough between and during checkpoints, then at the end of the checkpoint, when the fsync() requests start arriving in quick succession, system throughput goes down the tubes as every available write cache fills up and overflows.  I/O service times become very long, and overall performance tanks, sometimes for minutes.  A couple of things have been tried over the years to ameliorate this problem.  Beginning in PostgreSQL 8.3 (2008), the checkpoint writes are spread out over a long period of time to give the kernel more time to write them back, but whether it actually does is up to the kernel.  More recently, attempts have been made to spread out the fsync() calls as well as the writes, but I'm not aware that any of these attempts have had fully satisfying results, and no changes along these lines have been judged sufficiently promising to justify committing them.  In one sense, what PostgreSQL really wants to know is whether starting the next fsync() now is going to cause the I/O subsystem to become overloaded, and there's no easy way to get that information; in fact, it's not clear that even the kernel has access to that information.   (If it did, we'd also hope that it would take care of throttling the fsyncs a little better even in the absence of specific guidance from PostgreSQL.)

The other major innovation that I think has been broadly useful to PostgreSQL users, dirty_background_bytes, dates to 2009.  This was an improvement over the older dirty_background_ratio due to the fact that the latter couldn't be set low enough to keep the dirty portion of the kernel's write cache as small as PostgreSQL needed it to be.  But it's unclear to what extent any further progress have been made since then. Mel Gorman points out in his post that the behavior in this area was changed significantly in Linux 3.2, but many PostgreSQL users are still running older kernels that don't include these changes.  RHEL 6 ships with the 2.6.32 kernel, which does not incorporate these changes, and RHEL 7, which is apparently slated to ship with 3.10, is still in beta.

It seems clear based on recent discussions that the Linux developer community is willing to consider changes that would make things better for PostgreSQL, but their ability to do so may be hindered by a lack of good information.  It seems unlikely that all of the problems in this area have been fixed in newer releases, but more and better information is needed on the extent to which they have or have not.  Perhaps someone's already done this research and I'm simply not aware of it; pointers are appreciated.


  1. Good luck. I've gotten conflicting answers to similar questions. In private email, one of the XFS developers assured me that fsync entanglement is an immutable fact of life, that it's impossible for them to provide any information other than "everything in the universe has been flushed", and anyone who asks for more is an idiot. On the other hand, at the Linux filesystem developer summit after FAST, both XFS and ext4 developers seemed to think that providing information about the current journal commit number would be easy. Getting information about *how much* data is tied to the current journal commit is probably somewhere in between those two extremes, but maybe if enough people approach the problem enough different ways something useful will happen.

  2. It should be noted that he followed your quote with the following caveat:

    "It is important to keep this in mind although sometimes the ordering
    requirements of the filesystem may make it impossible to achieve."

    If you are writing to blocks which have been pre-allocatd and pre-initialized, doing this via fdatasync(2) is actually pretty easy. If however you are writing to newly allocated blocks, to a newly created file (this is how sqllite works; a check point operation copies the whole d*mned database to a newly created file as part of its checkpoint operation), it's going to be pretty hard to make fsync(2) not trash other processes' I/O latencies.

    1. There is no need to malign SQLite with incorrect claims.

      SQLite does not copy everything to a new file as part of check pointing. In older versions (pre-WAL) data that is about to be changed is copied to a journal and the main database updated. Once a commit happens the journal; can be truncated/deleted. It only contains as much data as was changed and the database has to by definition have both the before and after data around so it can do a rollback or commit.

      In WAL (write ahead log) changes and additions are written to a separate WAL file. When a checkpoint happens the data from the WAL is merged into the main database file and the WAL truncated. Again both the before and after data is available and the amount of writing is based on the amount added and changed.

      The one time SQLite does copy the whole database to a newly created file is when VACUUM is run, and that is precisely what it is documented to do. You use VACUUM to recover space marked as deleted/free as well as defragment the content (SQLite operates in pages). It is possible to use auto vacuum (see SQLite doc) but even that won't result in the whole database being copied frequently.

      In some cases users do get very aggressive about this because they find that vacuuming a database used for quite a while by the browser does improve performance. Some developers drink that kool aid too. Again SQLite is doing exactly what they asked and is copiously documented.

    2. Fair enough, I apologize for being too general in maligning SQLite.

      The frustrating thing for me is that when end users use an application which does Stupid Stuff, whether it is using VACUUM far too frequently, or putting SQLite commits in the GUI's event manager thread, the end users tend to pile all of the blame on the file system developers. :-(

    3. I did read what Mel wrote about it perhaps not being achievable, but I don't buy it. Just as a thought experiment, suppose that fsync() is coded to write one block that was dirty prior to the call down to the platter each *year*. It is pretty clear that any non-trivial fsync() operation will take unreasonably long to complete, but the impact on the system's foreground work will also be essentially nil. Unless there's something very special that the hard disk must do when writing an fsync()'d block that it does not need to do for an ordinary write, this is a straightforward latency vs. throughput trade-off: the faster you push blocks to disk, the quicker you'll get done with the fsync(), but the slower everything else will be in the meantime.

    4. Another important point to remember is that the database doesn't control the workload any more than the kernel does. If the user inserts 10GB of data, the database must allocate 10GB from the FS; if the user updates 10GB of data, we've got 10GB of dirty data that must be flushed to disk at checkpoint time. Those aren't unreasonable things for users to want to do, and the question of why the filesystem can't cope with them without choking isn't unreasonable either.

      To be sure, there may be some tricky engineering problems here. If, for example, there are caching layers between the kernel and the disk that even the kernel has no knowledge of, it may well run into some of the same problems that PostgreSQL does: namely, inability to detect that we're creating a write glut until after we're well past the point of no return. But it seems to me that those problems should be less severe at the kernel level than they are in user space, because the biggest cache by far is the OS buffer cache, and the kernel DOES have control over that.

      Regardless of engineering complexity, the problems need to be solved because the demands aren't fundamentally unreasonable. Users want to be able to push the highest possible transaction rate through PostgreSQL on Linux with the lowest possible latencies, and we should support that, not brand it as unreasonable requirement - because it isn't. Of course, not all the fault here is on the side of kernel developers; PostgreSQL has no shortage of internal issues that need fixing also. The point here isn't to place condemn, but to say, hey, we don't see a way to solve this particular problem from the PostgreSQL side without changes to kernel behavior. Thanks.

    5. Well, first of all, there is something special a hard drive must do when writing a fsync()'d block, and that's the CACHE FLUSH command. When we send a block to the disk, the HDD can legally hold on to it for days, weeks, years, before writing said block to stable store. In practice, it's rarely more than a second, but we don't know when it's safely to stable store, and the HDD doesn't give us notification that it is written to stable store.

      The other thing to note is that there are many applications which assume that fsync() won't take years. There is the infamous example of Firefox putting a SQLite commit call in its event manager thread, and that's one where we had users calling for file system developer's blood that fsync() was taking too long. It may be that for PostgreSQL you don't care about the commit thread taking decades, but I'm not sure I buy that. Do you mean to tell me the commit thread never has to take any other locks, such that if the commit thread were to take a long, long time, other foreground threads wouldn't eventually come to a crashing halt?

      And in any case, there are plenty of legitmate examples (unlike the infamous Firefox debacle), where the fsync() call does have to be done as part of the forground operation. Now consider the entangled write problem, where you have some applications that consider fsync() to be "foreground work" and some applications running on the same file system which consider fsync() to be "background work" at the same time.

      Even if we did put a more sophisticated logging engine, much more like what a RDBMS might have, with both a write ahead and rollback logs, to help mitigate the entagled write problem, then in the best case the file system will have performance roughly comparable to what OracleFS had, when Oracle tried pushing the concept of a file system that was backed on an OracleDB back end. And I'll remind you that traditional file system designs, which does not make these guaranteees, could be far more performant than an OracleFS.

      So I'm sure there are somethings we can do better, but at the end of the day, There's No Such Thing As A Free Lunch.

  3. One other thiing to note --- a lot of what you are talking about vis-a-vis writeback behaviours is something where we can certainly do a better job. In particular the fact that the writeback daemons don't start pushing blocks to disk until a percentage of available memory contains dirty blocks (which might have worked well when systems had hundreds of megabytes, but not now when systems with hundreds of gigabytes are not unheard of), is definitely a problem.

    Part of the reason why it's been unsolved for such a long time is that writeback strategies are kind of in a no-man's land. Half the code is in fs/fs-writeback.c, and the other half is in mm/page-writeback.c, and so it doesn't get enough love, since the file system people don't really consider writeback to be a fs topic, and the mm people don't consider part of the core mm responsibilities --- which is why certain problems like automatically tuning when dirty pages should be staged to the HDD doesn't really get that much love.

    This is completely different from what most people complain about when they talk about what has sometimes been colloquially called the "O_PONIES" problems, which at the end of the day is very much wrapped up with the fact that the file system can't give you efficient notification of when writes might be on stable store, because we don't get that notification from the disk unless we do a CACHE FLUSH operation. (And this is where handing back the commit ID is something we can do, if people are willing to live with getting a notification after the 5 second ext3/4 commit timeout has happened, such that we've sent a CACHE FLUSH command for other purposes.)

    So when you say "fsync" problem, it may be misleading to some people. The question of tuning writeback, so that a lot of the work is done before the fsync() is issued, is a bit different from other people's complaints about fsync(2) being too slow, so they want to skip the fsync() call, and yet they still complain when they lose data (hence the quip that what they application programmers really want is the O_PONIES open flag).

    1. I am not (and I do not think anyone on the PostgreSQL side is) complaining about fsync being too slow. Obviously, it would be nice if it were faster, but durability is much, much more important. What we're talking about here is the effect on the *rest* of the system during the time fsync() is doing its thing.

      And, as you say, that clearly ties into the writeback behavior, because if the kernel does a good job getting most of the data out to disk before the end-of-checkpoint fsync() calls arrive, then obviously the impact on the rest of the system when they do arrive will be reduced. Now the flip side of that is that we don't really care that much how long those calls take as long as the rest of the system doesn't starve for I/O in the meanwhile. So if tuning the writeback behavior is the way to get there, fine, but any other solution that achieves the same objective is probably OK, too.

      What I think is missing here is the idea that size matters. When PostgreSQL calls fsync() on the write-ahead log, we're flushing data into a preallocated file, generally 8kB at a time (or however many sequential 8kB chunks have filled since the last fsync() was done). That generally works as we would expect: it's fast if your I/O subsystem is fast, and it's slow if your I/O subsystem is slow, and if you don't like it, buy a faster disk (or BBWC/FBWC). What causes problems are the fsync() calls on the data files, which can be up to 1GB each, can have the dirty blocks distributed arbitrarily throughout the file, and fsync() them all in quick succession. Short of an O_PONIES flag, this is *never* going to be fast. And we don't need it to be fast. We just need to be able to do other things while it's happening. An fsync_slowly() call would be just fine for this purpose, or maybe that should just be the default behavior whenever a process does an fsync() that requires writeback of more than N blocks.

  4. Something that might be useful for this use case is the Linux system call sync_file_range(2). It's Linux specific, and it would require the userspace application to understand how many pages to submit at a time to avoid swamping the I/O subsystem. But one thing that would require any new kernel code that might help make things better is to use ionice(2) in the database writeback thread, and then using sync_file_range() in a loop to issue writeback in chunks if a few megabytes at a time, and using the SYNC_FILE_RANGE_WAIT_BEFORE to throttle the calls to sync_file_range().

    Yes, it's manual, and arguably the kernel should be able to do a better job automatically. But if you want to work around kernel behaviour for currently released distribution kernels, it might be something that's worth exploring.

    1. Sounds like an interesting idea. Some of the PostgreSQL developers (including me) have played with sync_file_range() in the past without notable success, but I'm pretty sure that nobody's tried anything very similar to what you're proposing here. I'm not sure I understand the idea in detail; maybe we can discuss more via email or at LSF/MM or Collab.

      One thing that's proven to be tricky in the past is that PostgreSQL relies on the kernel to reorder writes. If we were to issues fsync() calls more frequently, we'd reduce worst-case latency, but also deprive the systems I/O scheduler of the freedom to do those writes in the most efficient order, harming throughput. Ideally we'd like to be able to dump all the I/O on the kernel at once and let it sort out the optimal write ordering, rather than having to submit it in small batches. Still, there might still be a win here with something like what you're suggesting; we'd need to flesh it out into a patch and then test. Thanks for the suggestion.

    2. I think on problem with the range-sync idea is that while we can assume adjacent blocks in the same file are probably adjacent on disk, we know nothing about blocks in different files. Only the kernel knows how they map to blocks on the drive. Now, drives present a virtual block mapping to the kernel which might not even match physical characteristics, and with virtualization and SAN/NAS, the distinction might not be important anymore. Also, only the kernel knows what other i/O might be happening on each device.

  5. I am a senior performance engineer for Nokia. This is an issue that has been driving me nuts for a long time. I hope it is addressed soon. We write a lot of log data on our systems, and at times the I/O wait times can bring a web server or proxy browser to its knees, adversely affecting a lot of our customers, of which there are about 100 million... I'm dealing with just such a problem right now, but unless I want to start hacking the kernel (our operations people would not be happy with that), I can only provide band-aids - no real solutions.

  6. THP compaction on Linux 6+ causing this?

  7. Hi, in your bog you mentioned that
    "More recently, attempts have been made to spread out the fsync() calls as well as the writes, but I'm not aware that any of these attempts have had fully satisfying results, and no changes along these lines have been judged sufficiently promising to justify committing them. "
    I am wondering if you could point me to some reference about this?