maintenance_work_mem controls the amount of memory that the system will allocate in two different cases which are basically unrelated to each other. First, it controls the maximum amount of memory that the system will use when building an index. In order to build a btree index, the input data must be sorted, and if the data to be sorted does not fit into maintenance_work_mem, it will be spilled to disk. At least in my experience, this does not tend to be a big problem. Index builds are rare in many installations, are generally initiated manually, and you can always raise the value just for your local session by issuing the SET command before CREATE INDEX if you have determined that it helps performance in your environment. The second thing maintenance_work_mem does is control the amount of memory that vacuum will allocate to store the TIDs of dead index tuples. It can be a little bit difficult to understand how to tune the amount of memory that is needed for this purpose.
In brief, vacuum scans the table and collects the TIDs of all the dead tuples. Then it scans the indexes to remove all entries for those TIDs from the indexes. If it runs out of memory to store the TIDs of dead tuples before it scans the whole table, it will stop the table scan, scan the indexes, discard the accumulate list of TIDs, and then resume scanning the table from the point where it left off. If there is even one more dead tuple encountered after that point, it will end up scanning all of the indexes a second time. On a large table, scanning the indexes more than once is really expensive, especially if the table has many of them. If your maintenance_work_mem setting is drastically too low, you could even get more than two index scans. That would stink. So, you want to set maintenance_work_mem high enough that you never get multiple index scans.
On the other hand, you also don't want to run out of memory. Note that the system will run as many autovacuum workers at a time as specified by autovacuum_max_workers, and each one will allocate an amount of memory equal to maintenance_work_mem. Currently, it always allocates the full amount; hopefully, that will be fixed in the future. So, if you have the default setting of autovacuum_max_workers = 3, and you set a value like maintenance_work_mem = 10GB, you can and very often will have 30GB of RAM dedicated just to autovacuum, not counting any foreground VACUUM or CREATE INDEX operations that you may run. You can very easily run a small system out of memory this way, and even a large system may have some performance problems.
Fortunately, it's possible to make a rough estimate of how much memory you are likely to actually need. By default, vacuum kicks in when the number of dead tuples is estimated to be about 20% of the size of the table (see the autovacuum_vacuum_scale_factor parameter). Each dead tuple will require 6 bytes of maintenance_work_mem. So, if you know about how many rows your largest table contains, and you haven't changed the autovacuum scale factor, you can just multiply that number by 1.2, add something for insurance, and set maintenance_work_mem to the resulting number of bytes. For instance, if your largest table contains 100 million tuples, 100 million * 1.2 = 120 MB, so maybe configure 192MB or even 256MB to be on the safe side. Note that in the case of a partitioned table, each partition will be vacuumed separately, so what matters is the individual partition with the largest number of tuples, not the total number of tuples in the entire partitioned table.
Sometimes it's easier to estimate based on the on-disk size of your largest relation rather than the number of row it contains. To do this, you'll have to make some estimate of how many tuples you have per 8kB page. For example, if you estimate that your average row is taking up about 128 bytes on disk, which is not an unreasonable estimate in general but will obviously vary based on what data you have stored in your table, then you have about 64 tuples per 8kB block, which works out to about 8192 tuples per 1MB of table size. So if your largest table is 10GB, that would be about 10240 megabytes * 8192 tuples/MB * 1.2 bytes/tuple = ~96 MB of maintenance_work_mem; perhaps configure 128MB or so to be conservative. You can see that you need a fairly big table to require more than the 1GB I recommended in the first paragraph, and hopefully this explanation makes it clear why this is so.
Bear in mind that this is not an exact science. Sometimes autovacuum may not start the instance the table hits the 20% threshold, for example because it's busy processing some other table. At other times vacuum may run before the table hits the 20% threshold, for example because the table needs to be vacuumed for wraparound, or because vacuum has been run manually. And of course you have to think about whether the calculated value is going to run your system out of memory. The calculation above is intended to give you a ballpark figure, not a precise value.
If you do have very large tables, one thing you can do to reduce the amount of maintenance_work_mem that will be required is to reduce autovacuum_vacuum_scale_factor to a smaller value. Instead of vacuuming the table when the number of dead rows is 20% of the table size, you might want to vacuum the table when the number of dead rows is 1% of the table size, or even 0.1% of the table size. This can be configured on a per-table basis, which is good, because you likely only want to reduce it for your very large tables. This might seem like a strange recommendation, since vacuuming the table more frequently and scanning the indexes once each time doesn't necessarily save any I/O vs. scanning it less frequently and scanning the indexes multiple times each time. However, that ignores the impact on your foreground tasks. Especially on very large tables, more frequent vacuuming tends to result in substantial improvements to query performance. If you only vacuum a 20MB table when it has 20MB * 20% = 4MB of dead tuples, that's OK, because 4MB isn't that much. If you only vacuum a 200GB table when it has 200GB * 20% = 40GB of dead tuples, that tends to be fairly painful, because every sequential scan of the table has to read and skip 40GB of extra stuff.
In addition to setting autovacuum_work_mem appropriately, you can also monitor to see whether you are getting the desired results. I strongly recommend setting log_autovacuum_min_duration to a non-default value, because it will give you a lot of insight into what autovacuum is doing, which can be very useful for troubleshooting not only problems with maintenance_work_mem but practically any autovacuum problem you may be having. My favorite value for this parameter is 0, which logs every action performed by autovacuum. If you have a large number of tables, though, that may generate an unreasonable number of log entries. In that case, choose some value that will cause the autovacuum operations on your larger tables to be logged without overwhelming you with messages. Each time an autovacuum operation is logged as a result of this setting, the log message will end with something like "index scans: 1". You might sometimes see "index scans: 0". If you see "index scans: 2" or any larger value, you are running out of maintenance_work_mem and should consider increasing the value.
I have a vague feeling that PostgreSQL could be smarter than it is in this area. In the example calculation above, the amount of maintenenance_work_mem that is required is about 1/100th of the size of the largest table. For a significant percentage of users, though certainly not all, the size of the active portion of the database is no more than a few times larger than physical memory. Such users would be quite happy if autovacuum just allocated a small amount of memory initially and grew that amount as necessary to avoid multiple index scans, with no upper limit at all. Despite the lack of an upper limit on memory consumption, they would never actually run out of memory, because only the amount of memory actually required would get allocated, and that's probably not too large. On the other hand, in the current system, because the configured amount is always allocated, they can quite easily run out of memory by setting the value higher than what is really required. On the other hand, a strategy of allocating with no upper limit would be terrible for people running really large PostgreSQL databases on modest hardware. For those people, the current system is superior. So, it's not entirely clear how to improve on what we have today.
Thanks for sharing !
While investigating on vacuum, a college of mine has found that if "relblocks * LAZY_ALLOC_TUPLES * 6" is smaller that maintenance_work_mem, this value is used instead. We confirmed it with valgrind later on.
Did we miss something ?
* LAZY_ALLOC_TUPLES was 291 on our setup
I don't know of a constant called LAZY_ALLOC_TUPLES, but yes there's a limit that we won't allocate more memory than could ever possibly needed. If every TID on every page were dead, you'd have 291 dead TIDs per page and 6 bytes per TID.Delete
"Each dead tuple will require 6 bytes of maintenance_work_mem"
Where do i find these information in the official documentation, or how do i verify it by myself?
6 bytes is the size of an ItemPointerData. The relevant code has been moved around a few times, but in the master branch as of this writing it's in vac_max_items_to_alloc_size.Delete