Robert Haas: Plans for Partitioning in v11

The reaction to the new table partitioning feature in PostgreSQL 10 has been overwhelmingly positive, but a number of people have already astutely observed that there is plenty of room for improvement. PostgreSQL 10, already at beta3, will go GA some time in the next month or two, and presumably once it does, more people will try out the new feature and find things they would like to see improved. The good news is that there is already a substantial list of people, many of them my colleagues at EnterpriseDB, working on various improvements. The following list of projects is complete to my knowledge, but of course there may be projects of which I'm unaware especially at companies where I don't work. If you're working on something else, please leave a comment about it!

[ Update 2018-05-10: Most of the projects mentioned below got committed to PostgreSQL 11 and are expected to appear in that release, barring late reverts. I've added updates below to cases where the feature didn't get committed, or where the project got taken over by someone other than the original author. ]

Amit Langote (NTT), the original author of the table partitioning feature, has just recently posted an important series of patches to expand inheritance hierarchies in bound order (hereinafter, EIBO). There are only minimal directly user-visible effects of this change: EXPLAIN output may be ordered differently, and results may show up in a different order (unless constrained using an ORDER BY clause). However, this change will make it much easier to implement various other desirable partitioning features, as will become clear from the discussion that follows. One such change is to allow bound-based partitioning pruning, a feature for which I understand that Amit is planning to post a patch in the near future. Currently, partitions are eliminated one-by-one by comparing the partition constraint (and any other constraints) against the query to determine which partitions can't possibly contain any matching rows. This means that the time to prune partitions is proportional to the number of partitions. With this feature, we'll be able to pick the exact partitions we want to keep and discard the rest in bulk, making this (hopefully) a lot faster.

Amit has also submitted a number of other patches around partitioning. One of these will re-enable INSERT .. ON CONFLICT DO NOTHING for partitioned tables; that was briefly enabled during v10 development but didn't end up enabled in the final version. Another, to which Ashutosh Bapat also contributed, will improve \d+ output for partitioned tables.

Beena Emerson (EnterpriseDB) is currently developing a prototype patch for execution-time partition pruning. Currently, partitions are pruned only at plan time, which means that when the value of the partitioning key isn't known until execution time, no pruning is possible. For example, consider a nested-loop join between two partitioned tables where the inner table is index-scanned using a value taken from the outer table; suppose the indexed column is also the partition key. Right now, we'll index-scan every partition for matches, even though there's only one partition that could conceivably contain a match. Index scans returning no rows are pretty fast, but skipping them altogether should be a lot faster, especially when there are many partitions. Prepared queries using generic plans will also benefit; in such cases, the planner doesn't know what value will be supplied and therefore can't prune, but at execution time the value will be known and pruning will be possible. Since the idea is to make run-time partition pruning bound-based, it will require EIBO. [ David Rowley took over this work and Álvaro Herrera committed it. ]

Amit Khandekar (EnterpriseDB) has written a patch to allow updates that move rows between partitions, which I think will be an extremely popular feature. Essentially, such updates will be broken into a DELETE operation followed by an INSERT operation. Unfortunately, at least at the READ COMMITTED isolation level, this will create new kinds of serialization anomalies that don't exist at present. This can't be completely prevented because the EvalPlanQual mechanism which PostgreSQL uses in such cases depends on each tuple in an update chain pointing to the next-newer version of that tuple using a block number and offset within the block, and there's no space to store the ID of another relation entirely in the on-disk format. However, what we can do is detect the problem and throw an error; one of my colleagues will tackle that problem if nobody else beats us to it. This patch also depends on EIBO getting committed.

Amul Sul (EnterpriseDB) has written a patch to allow hash partitioning. Initial discussion of this patch revealed that the PostgreSQL community would like to see a number of other things done first. For example, bound-based partition pruning should be committed first, because constraint exclusion won't work for hash partition constraints, and we want hash partitioning to have working partition pruning. Another problem is that the hash functions which PostgreSQL uses might not always be completely portable. Integers, for example, hash differently on big-endian and little-endian platforms, and strings might hash differently under different encodings. To help work around this problem, Rushabh Lathia (EnterpriseDB) has written a patch for pg_dump that makes it possible to dump all the partitions in such a way that the data will be reloaded into the root of the partitioning hierarchy. This will allow tuple-routing to restore that row into the partition to which it now belongs rather than the one to which it belonged before. This isn't only useful once we have hash partitioning: it could also happen with range partitioning on a text column if the collation definitions are different on the source and target systems. The problem will probably be more likely to arise with hash partitioning, though.

There are a few other problems, too. Right now, the proposed patch for hash partitioning uses the same hash functions that we use for other purposes, such as hash indexes, hash joins, and hash aggregates. However, it would be better to use a different function, because otherwise a hash index built on a hash partition will have mostly-empty buckets; and similar problems can occur in other cases where hash functions are used. If we're installing new hash functions anyway, we might want to consider making them faster or more portable, or define operator families that are consistent across more types. Jeff Davis (whose corporate affiliation on PostgreSQL's contributors page appears to be out of date) was the first to point out these problems, and may do some work on them. I am thinking about them as well. [ This didn't get addressed for v11, but was judged to be non-critical. ]

Another frequently-requested feature is default partitioning. Rahila Syed (EnterpriseDB) and Jeevan Ladhe (EnterpriseDB) have written a patch to implement default partitioning for list-partitioned tables, and Beena Emerson has written a patch which extends this to range-partitioned tables. A default partition accepts all rows that don't map to any other partition. Note that a range-partitioned table may have multiple ranges not covered by any other partition, and a default partition will accept rows that fall into any of those ranges; this feature couldn't exist in the form proposed here on databases that don't allow "gaps" in the key space not covered by any partition.

Maksim Milyutin (Postgres Pro) has started to tackle the important problem of defining indexes on partitioned tables. In v10, while individual partitions can have indexes, there is no such thing as an index on the partitioned table itself. However, it would be convenient for users to be able to define an index on the partitioned table and have it cascade down to all the partitions, creating a matching index on each one. That way, new partitions would automatically end up with the correct indexes, and generally a lot of typing would be saved. [ Álvaro Herrera took over this work and committed it. ]

While foreign tables can be partitions in v10, they are currently subject to an unfortunate limitation: tuple inserts that route to a foreign table will fail, since this feature has not been implemented yet. Etsuro Fujita (NTT) is working to address this limitation by revising code originally written by Amit Langote.

Ashutosh Bapat (EnterpriseDB) is continuing to work on partition-wise join, which was originally submitted for PostgreSQL 10 but wasn't ready to commit as of the feature freeze date. This feature allows a join between two compatibly-partitioned tables to be done partition-by-partition, which can sometimes be a lot faster. The initial version of this feature is likely to leave out a lot of important things, such as optimizations to save memory, and being able to interleave partitioned and unpartitioned joins. Both Ashutosh and I would like to see such optimizations included eventually; but even without them, this is a very powerful feature. This feature doesn't have EIBO has a hard prerequisite, but it will be faster and cleaner if EIBO is done first.

Ashutosh also has a patch for an optimization of the existing constraint exclusion logic which could speed up partition pruning when no partitions at all need to be scanned.

Antonin Houska (Cybertec) and Jeevan Chalke (EnterpriseDB) have both done work on partition-wise aggregate, which means aggregating the partitions of a table one-by-one instead of aggregating all at once. If the partition key matches the grouping key, every partition will produce a discrete set of groups, and this technique should almost always be a win. If not, it can still be used by leveraging the multi-step aggregation infrastructure that is currently used for parallel aggregate: first, perform a PartialAggregate step on each partition, and then collect the results and perform a FinalizeAggregate. This will tend to win when individual partitions contain many values with the same grouping key, but it will be inferior to the straightforward approach when each partition contains, say, only a single value in each output group. [ Antonin's work was not committed to v11, but will hopefully be back for a future release; it has less overlap with Jeevan's work than I thought when I wrote this post originally. ]

Ronan Dunklau (Dalibo) and Julien Rouhaud (Dalibo) have another interesting idea for which they've got a draft patch: if we need ordered data from a partitioned table, and the ordering matches the partitioning scheme, we could skip the MergeAppend that is normally needed to produce ordered data in such circumstances. Instead, we could just scan the partitions in the correct order and do a simple Append of the results. I suspect that EIBO and maybe partition-wise join will make this easier to implement. [ Not done for v11. ]

While those are all of the things I know about right now, I think it's nearly certain that many other patches for partitioning improvements will show up as we get deeper into the release cycle; remember, all of these patches showed up before v11 development even started (that's due to happen any hour now). I'm quite excited to see how partitioning gets better in v11, v12, and beyond, and I expect that I and my colleagues at EnterpriseDB will be continuing to put a lot of effort in this for quite a while. We'll also be doing our best to help review patches from others in the area of partitioning, and suggestions and feedback are most welcome.

Robert Haas

Monday, August 14, 2017

Plans for Partitioning in v11

4 comments: