Tuesday, June 10, 2014

Linux disables vm.zone_reclaim_mode by default

Last week, Linus Torvalds merged a Linux kernel commit from Mel Gorman disabling vm.zone_reclaim_mode by default.   I mentioned that this change might be in the works when I blogged about attending LSF/MM and again when I blogged about how the page cache may not behave quite the way we want even with vm.zone_reclaim_mode disabled.

For those who haven't read previous discussion on this topic, either on my blog, on pgsql-performance, or elsewhere around the Internet, enabling vm.zone_reclaim_mode can cause a lot of problems for applications, such as PostgreSQL, that make use of more page cache than will fit on a single NUMA node.  Pages may get evicted from memory in preference to using memory on other nodes, effectively resulting in a page cache that is much smaller than available free memory.  See the second of the two blog posts linked above for more details.

PostgreSQL isn't the only application that suffers from non-zero values of this setting, so I think a lot of people will be happy to see this change merged (like the guy who said that this setting is the essence of all evil).  It will doubtless take some time for this to make its way into mainstream Linux distributions, but getting the upstream change made is the first step.  Thanks to Mel Gorman for pursuing this.

3 comments:

  1. I never really understood why this was ever a feature in the first place. So, to prevent per-process memory operations from being slightly more expensive, they replace it with transient latency that's orders of magnitude slower. What exactly is being saved, here?

    We've never been bitten by this because our CPUs just happen to be under that magical 20 threshold, but people have been complaining about this for years. It's good to see it's finally being put in its place.

    ReplyDelete
  2. What is the freebsd equivalent?

    ReplyDelete
  3. This is an old post, but a lot of FUD is still floating around about this. Zone-based reclamation exists for tightly optimized HPC-style code where all computation and memory fetches are synchronized across all cores, and latency variation snowballs into uselessness.

    It's a tiny minority of systems, so it makes sense to disable it by default, but it's *not* a bad feature; it's absolutely vital for tightly-tuned parallel simulations.

    Now that the pendulum has swung the other way, I'm running into system designers who are terrified to enable this feature even when the instrumentation shows that it's necessary, because it's "evil".

    ReplyDelete