I spend a lot of time answering questions about PostgreSQL, and one of the questions I get asked frequently is: how should I set shared_buffers? And, a bit less often, how should I set wal_buffers? I've got canned answers that I can rattle off so fast it'll make your head spin. Exceptions to my canned answers keep popping up, and it's starting to get hard to give an answer that actually captures all the complexity in this area, so here's a longer explanation.
For shared_buffers, the quick answer is to allocate about 25% of system memory to shared_buffers, as recommended by the official documentation and by the wiki article on Tuning Your PostgreSQL server, but not more than about 8GB on Linux or 512MB on Windows, and sometimes less. However, I've recently become aware of a number of cases which suggest that higher values may perform much better. PostgreSQL uses the operating system's memory for caching, but, as recent performance results have shown, there can be a huge performance benefit to having the entire database in shared_buffers as opposed to merely having it in RAM. Several EnterpriseDB customers or potential customers have reported excellent results from cranking up shared_buffers to very large values, even giving PostgreSQL the lion's share of system memory for its own use. This flies in the face of some conventional wisdom that PostgreSQL doesn't handle large shared_buffers settings well, but enough people have reported good results that it seems worth taking this strategy seriously.
Of course, the conventional wisdom is not without its reasons. One of the disadvantages of setting shared_buffers very large is that the memory is nailed down for page caching, and can't be used for other purposes, such as temporary memory for sorts, hashing, and materialization (work_mem) or vacuuming and index builds (maintenance_work_mem). If your can't comfortably fit the working set inside shared_buffers without leaving a generous amount of memory left over for other purposes, then this strategy is not for you. Also, as I previously blogged about, if you happen to be running a 32-bit build of PostgreSQL, then you'll find that it's dangerous to set shared_buffers to more than 2-2.5GB, because other memory requests from individual backends may cause them to run out of virtual address space and die.
And, if you can't fit the entire working set inside shared_buffers, then there are a number of reasons to keep it relatively small, in line with the formula above. If the working set is larger than shared_buffers, most buffer accesses will miss the database buffer cache and fault a page in from the OS; clearly, it makes no sense to allocate a large amount of memory to a cache with a low hit rate. Furthermore, as dirty pages get pushed out of PostgreSQL's cache, they'll enter the operating system cache. If that cache is relatively large, the OS has more freedom to reorder writes and thus schedule the I/O more efficiently. This is especially important around checkpoint time, when the amount of additional dirty data that is pushed out into the OS cache figures to increase with shared_buffers; thus, a higher value for shared_buffers can contribute to latency spikes around checkpoint time.
In tuning wal_buffers, it's helpful to understand how these buffers are used. PostgreSQL backend processes initially write their write-ahead log records into these buffers, and then the buffers are flushed to disk. Once the contents of any given 8kB buffer are durably on disk, the buffer can be reused. Since insertions and writes are both sequential, the WAL buffers are in effect a ring buffer, with insertions filling the buffer and WAL flushes draining it. Performance suffers when the buffer fills: no more WAL can be inserted until the current flush is complete. The effects are mitigated by the fact that, when synchronous_commit is not turned off, every transaction commit waits for its WAL record to be flushed to disk; thus, with small transactions at low concurrency levels, a large buffer is not critical.
However, the standard value in releases prior to PostgreSQL 9.1 was 64kB, which is an amazingly small value for any kind of buffer on a modern system. In PostgreSQL 9.1, prompted by some good advice from Greg Smith (blog, book), we changed the value to auto-tune to 3% of shared_buffers up to a maximum of 16MB, the size of a single WAL segment. This results in a substantially higher value than was used in previous releases, and that's a very good thing. If you are using an older release and haven't set this value manually, it's probably worth plugging your value of shared_buffers into this formula and configuring the resulting value. Even on versions which use this auto-tuning formula, I've discovered that an even higher value can sometimes improve performance dramatically. On one test, involving 32 concurrent clients, I found that wal_buffers=64MB doubled performance as compared with wal_buffers=16MB; however, on another system I found that no setting I tried produced more than a 10% improvement over the auto-tuning formula. Still, especially if you have a system with lots of concurrent activity, higher values may be worth experimenting with, particularly if you have frequent checkpoints: the first write to any given page after a checkpoint will emit a full-page image into WAL to protect against torn pages in the event of a crash, and these full-page writes can chew through WAL buffers very quickly.
If you're serious about tuning your system for optimal performance, it's probably worth trying several different settings for these parameters and determining experimentally which settings work best in your environment. Far and away, however, the biggest problem with these settings seems to be that people don't tune them at all, or that they raise them only slightly, leaving them still too low.
Is there some way to measure wal_buffer usage in real time, so that we could simply monitor it for some period of time, and then come up with a way of determining if the current setting is sufficient? This seems preferable to doing trial & error changes.ReplyDelete
yeah, install pg_buffercache contrib..Delete
the official dox suggest that pg_buffercache only reports shared_buffers, not wal_buffers:Delete
I installed it anyway, but I'm not seeing anything that looks like wal_buffer data. Am I missing something?
No, I don't think you're missing anything. pg_buffercache reports on shared buffers, but not wal_buffers.Delete
One thing you can do to inspect WAL usage is to inspect how many bytes of data are written to WAL over an interval of time. Try the following to get the number of bytes in 5 minutes.Delete
1. SELECT pg_current_wal_insert_lsn();
2. Wait 5 minutes
3. SELECT pg_current_wal_insert_lsn();
4. SELECT pg_wal_lsn_diff('', '');
From Pg13 you can emit WAL generation in the explain planDelete
Do certain types of work loads favor different settings? High concurrency, bulk loading, complex joins on large tables, etc.ReplyDelete
I haven't done much benchmarking on bulk loading, but my guess is that there's nothing particularly special about that case. High concurrency does seem to require a bit more in the way of wal_buffers.Delete
Thanks for the great article. I understand that shared_buffers is the most important parameter in PostgreSQL which can change the whole performance drastically.ReplyDelete
Well, I am glad to see others are finally getting on board the train. We've been running with shared_buffers above 8GB for years now and suggestions to reduce to those level have seemed crazy to me. For those wondering, our experience is it's primarily useful to do that on systems with large amounts of ram (think 128GB at a minimum) that also have corresponding high speed disk i/o (so that large buffer amounts don't end up causing a bottleneck).ReplyDelete
Also, what Robert mentions about "working set" is also very important; it's not as simple as "does my entire database fit into ram" (although if it does, you are likely in good shape), but you need to know how much disk i/o you're going to be needing (not just within the database. imagine if you run a tar command to produce a base backup what that does to your filesystem cache. (and yeah, you should use omnipitr instead))
On a final note wrt wal_buffers, while Greg wasn't wrong per say with his advice, I don't consider it good :-) As you have noted, people have gotten improved performance by going above the autotuned amounts, and so I keep telling people for 9.1 to do the same thing I have been telling them to do for years. Set your wal_buffers to 16MB and forget about it. Even a small amazon instance has like 1.7GB iirc, so quibbling over 16MB is just silly.
All IMHO :-)
Great article, and great comments - thanks. Out of curiosity, since you said frequent checkpoints might justify even higher wal_buffers, what do you consider "frequent checkpoints"? We get the occational 'checkpoints are occurring too frequently' messages (21 seconds apart) in the log files during big jobs every few months, but not usually. In general, how do I observe how often checkpoints are occurring? We have a big database, big server.ReplyDelete
I have a major client who recently did some benchmarking in this area. I can confirm that cranking shared_buffers way up so the database fits in RAM was the winner. In this case, the machine has 512GB of RAM, and we set shared_buffers to 128GB, i.e. 25% of RAM, , which comfortably accommodates their database. Dialling it back caused a very easily observable degradation in performance. So I too don't believe the conventional wisdom.ReplyDelete
For a long time, I also belived in large shared buffers, but recently I did some tests in this area. I have one table about 50 milions of rows. I was shocked to see that disabling shared buffers (set them to 128kB) is better than have them set about 1 GB. I tested only in case of read-only SELECT queries. Seems that OS-caching performs better than shared buffers caching (in terms of SELECT queries performance). This was very strange to me but difference was obvious (about 20% faster when shared buffers was tiny).Delete
Is there any version guidance? May be there is a change on new versions... Great article.ReplyDelete
Hi Robert, just a few questions :ReplyDelete
When you say :
"then there are a number of reasons to keep it relatively small, in line with the formula above. If the working set is larger than shared_buffers, most buffer accesses will miss the database buffer cache and fault a page in from the OS; clearly, it makes no sense to allocate a large amount of memory to a cache with a low hit rate. "
it sounds like there's no intelligence implemented in shared buffer cache management. I have not found any document explaining how buffers are managed. But i can't imagine there's not some kind of promoting/evicting buffer feature based on how often a specific bloc is requested. And in that case, if it's well implemented, even without being able to fit the entire dataset in memory, then we can expect to have a good hit ratio.
So first question is basically : Is there such kind of promoting/evicting buffers based on "popularity" ?
Second question : How buffers are organized ? chained lists ? If so does the number of chained lists increases according to the shared buffer size ? Is there some kind of limit where the max number of chained lists will not increase anymore but the chain list size increases (thus it takes more time to scan it and possibly leads to mutex being acquired for a longer time when scanning for a free buffer) ?
We keep track of the rate of access to blocks that are currently in cache, but not ones that have been previously paged out and brought back in again. I'm not sure I understand your second question; there are three locks (one lwlock and two spinlocks) per buffer, each serving a different purpose; and buffer lookups are done via a hash table with 16 partitions each having its own lock. We do use a linked list to track completely unused buffers, but except after a relation or database drop or at system startup, it's normally empty.Delete
You got it regarding my second question. Basically that's what i'm looking for : Detailed information about how buffer management is handled in postgresql. I know the code is available, but if one is not used to reading complex code, it's quite difficult to get a clear picture about how all this is working. There's some kind of paradox when it comes to open source software. Oracle is not open source, but the amount of documentation about internals and how to take advantage of this knowledge to fix performance issue is really much more important than what can be found about postgresql. I'm also a bit surprised that all the words we hear about shared_buffers setting is just based on rules of thumb and guesses or feedbacks (which are probably true but that can't be generalized). Any workload is different from another. When a specific workload might take advantage of larger shared_buffer, another might not. For sure, one can, for different workload try and test the completion time for different values of shared_buffer. But how can one figure out why increasing shared_buffer does not help for very specific conditions ? The database shared buffers are obviously configured for concurrent access. Concurrent access possibly means bottlenecks. A solid instrumentation framework as well as detailed explanations about how postgresql kernel works would drastically help to decide how to take advantage of large shared_buffers, and if not why, and possibly how to workaround it. I like postgresql, but when practicing it makes me feel like i'm back to the old Oracle 7 days, when tuning and performance troubleshooting was all about guessing, which leads to the well known "troubleshooting compulsive disorder", i.e. : No systematic approach but trying to fix without understanding, and preying it will fix the issue.
Robert, regarding these parameters (shared_buffers, wal_buffers) and also work_mem: if we use huge pages on Linux (huge_pages = on, PostgreSQL 9.6), will all of these parameters use the huge memory allocated previously in Linux, or just the 'shared_buffers' parameter? Thanks.ReplyDelete
Both shared_buffers and wal_buffers are part of the shared memory segment, so if huge pages are in use then they will cover both of those things. work_mem, however, is allocated from the private memory of individual processes, so the huge_pages setting won't affect what happens there.Delete