Parallel query for PostgreSQL - for which this is the first step - has been a long-time dream of mine, and I have been working on it for several years, starting really in the PostgreSQL 9.4 release cycle, where I added dynamic background workers and dynamic shared memory; continuing through the PostgreSQL 9.5 release cycle, where I put in place a great deal of additional fundamental infrastructure for parallelism; and most recently today's commits. I'd like to tell you a little bit about today's commits, and what comes next.
But first, I'd like to give credit where credit is due. First, Amit Kapila has been a tremendous help in completing this project. Both Amit and I wrote large amounts of code that ended up being part of this feature, and that code is spread across many commits over the last several years. Both of us also write large amounts of code that did not end up being part of what got committed. Second, I'd like to thank Noah Misch, who helped me very much in the early stages of this project, when I was trying to get my heads around the problems that needed to be solved. Third, I'd like to thank the entire PostgreSQL community and in particular all of the people who helped review and test patches, suggested improvements, and in many other ways made this possible.
Just as importantly, however, I'd like to thank EnterpriseDB. Without the support of EnterpriseDB management, first Tom Kincaid and more recently Marc Linster, among others, it would not have been possible for me to devote the amount of my time and Amit's time to this project that was necessary to make it a success. Equally, without the support of my team at EnterpriseDB, who have patiently covered for me in many ways whenever I was too busy with this work to handle other issues, this project could not have gotten done. Thanks to all.
OK, time for a demo:
rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)
Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+-------
(0 rows)
Time: 213.412 ms
Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+-------
(0 rows)
Time: 213.412 ms
Here's how the plan looks:
rhaas=# explain (costs off) select * from pgbench_accounts where filler like '%a%';
QUERY PLAN
---------------------------------------------
Gather
Number of Workers: 4
-> Parallel Seq Scan on pgbench_accounts
Filter: (filler ~~ '%a%'::text)
(4 rows)
The Gather node launches a number of workers, and those workers all execute the subplan in parallel. Because the subplan is a Parallel Seq Scan rather than an ordinary Seq Scan, the workers coordinate with each other so that each block in the relation is scanned just once. Each worker therefore produces on a subset of the final result set, and the Gather node collects all of those results.
One rather enormous limitation of the current feature is that we only generate Gather nodes immediately on top of Parallel Seq Scan nodes. This means that this feature doesn't currently work for inheritance hierarchies (which are used to implement partitioned tables) because there would be an Append node in between. Nor is it possible to push a join down into the workers at present. The executor infrastructure is capable of running plans of either type, but the planner is currently too stupid to generate them. This is something I'm hoping to fix before we run out of time in the 9.6 release cycle, but we'll see how that goes. With things as they are, about the only case that benefits from this feature is a sequential scan of a table that cannot be index-accelerated but can be made faster by having multiple workers test the filter condition in parallel. Pushing joins beneath the Gather node would make this much more widely applicable.
Also, my experience so far is that adding a few workers tends to help a lot, but the benefits do not scale very well to a large number of workers. More investigation is needed to figure out why this is happening and how to improve on it. As you can see, even a few workers can improve performance quite a bit, so this isn't quite so critical to address as the previous limitation. However, it would be nice to improve it as much as we can; CPU counts are growing all the time!
Finally, I'd like to note that there are still a number of loose ends that need to be tied up before we can really call this feature, even in basic form, totally done. There are, likely, also bugs. Testing is very much appreciated, and please report issues you find to pgsql-hackers (at) postgresql.org. Thanks.
 
9.6 NO *CRY**CRY**CRY*
ReplyDelete+10;
ReplyDeleteFinally something user-visible! Thank you guys.
ReplyDeleteI'm crying out for parallel query so much I'm tempted to go back helping to test. Thanks much needed feature.
ReplyDeletePlease, please help test.
DeleteDo it!
DeleteBloody fantastic. Will this also work with partitioned tables?
ReplyDeleteNot yet. That's on my short list, but { long discussion of why that's not trivial omitted }.
Deletecould you please talk about any relationship between max_parallel_degree and effective_io_concurrency?
ReplyDeleteThere is basically no relation between those things. effective_io_concurrency controls prefetch depth for bitmap index scans. max_parallel_degree limits the number of workers per Gather node.
DeleteDoes that work with aggregates?
ReplyDeleteA parallel sequential scan can feed an aggregate, but it is not yet possible to do the aggregation itself in parallel. Obviously, that's a limitation I'd like to lift at some point.
DeleteThank you!
ReplyDeleteIf i set max_parallel_degree = 4 so have 4 workers. Are they using separate cores of my machine?
ReplyDeleteThat's up to the operating system scheduler, but presumably yes. If the total number of processes trying to run on the machine is greater than the number of CPU cores, then they can't all run simultaneously, so in that case you could potentially end up with workers that aren't actually running, which would be just a waste of resources. But assuming that's not the case, the system will presumably run all the workers at once.
DeleteThanks for explanation :)
DeleteCongratulations Robert! Incredible determination by you Noah, Amit and the entire PostgreSQL community. Very happy for all of you!! I will definitely download and play with it in the near future.
ReplyDeleteThanks, Tom.
DeleteTom, let me be the first commenter to say thanks to EnterpriseDB, even great free software coders need to eat.
DeleteHow do we deal with varying I/O overhead?
ReplyDeleteE.g. on a 16 core box I'd love to have 10+ workers doing parallel sequential scan over data that is already in memory.
But in case the data must be read from storage I want to throttle workers count so that they do not compete for I/O. Micromanagement such details for each query seems to be tedious.
Is there any chance PostgreSQL would figure out best plan (vary workers count) automatically?
There is a ton of interesting research work to do be done to solve problems like this. Right now, it's very much unsophisticated. The emphasis up until now has been on getting it to work at all - from here, we need to shift to more of a performance focus, but ironing out complicated problems like this will take time and, importantly, reports of real-world experiences from the field.
DeleteI agree with anonymous and I hope you get time to make some progress on optimizing the number of workers for a given query. The `max_parallel_degree` setting should be a maximum number of workers as the name implies, not a fixed number. Presume this is why you gave it this name?
Deleteseems you guys are trying to match citus with their efforts in PSS and pushing down queries to the worker nodes.
ReplyDeleteare you guys looking to match their features 100% or which features can't you do as of 9.6 or are looking to do with 9.8?
either way, I applaud this as scaling out postgres is a much needed feature!
Parallel query is more of a vertical scalability solution than a horizontal scalability solution. The goal is to effectively use multiple CPUs on a single machine, not multiple machines as CitusDB does. Using multiple machines would be a cool project, too, but it wouldn't be this project, even though many of the query optimization problems are related.
DeleteFor horizontal scalabilityI would rather refer to Postgres-X2, and esp XL which has similar Scatter-Gather approach with the optimization that each worker node can push down the work to other worker nodes, so it is more like a mesh/grid and not a tree. I do hope XL, X2 and core will be merged at some point. Robert, I missed the last eu pgconf unconference - I think it was supposed to be discussed there?
DeleteBig thanks for your work of course!
@Robert Haas. CitusDB does vertical scalability out of the box for the worker nodes. So if you had 4 x 4 core boxes. You have at your disposal 16 cores working in unison for queries.
DeleteBut for open source purposes, having PSS available. I cannot wait :)
> adding a few workers tends to help a lot, but the benefits do not scale very well to a large number of workers.
ReplyDeleteThis is because with a large number of concurrent readings you start to get IO waiting. Basically, for a data stored on a single disk, 4 workers would be optimal if: (time of reading a block of data from disk) == 3 x (time of processing a block of data).
But isn't a DB sitting on a single (rotational) disk rather small-potatoes these days?
DeleteYes it is, but the OS is actually a fairly big bottleneck on IO these days, and until that overhead is solved we can't really take advantage of our arrays of ram disks and SSDs
DeleteCongratulations PostgreSQL ! Thanks a lot Robert, Amit and Noah !
ReplyDeleteIs it possible to use parallel sequential scan in vaccuum?
ReplyDeleteThanks to make it in!! With all you guys I can hope great future to PostgreSQL and In coming days PostgreSQL will be best in performance among all top RDBMS.
ReplyDeletethis is really great feature!
ReplyDeletedo you think it makes sense to put in a setting on total number of parallel workers?
my concern is that if each query can spawn 4 threads and we get 100 connections - we may end up with 400 threads..
it would be nice to have a gloabal limit on PQs.
then we would not end up like work_mem which can eat all your memory..
thanks
max_worker_processes limits the total number of worker processes that can exist in the system at any one time for any purpose. Parallel worker processes count against this limit, as do any other background worker processes that are running.
DeleteWhat will happen to a query requesting parallel worker processes if the limit is reached?
DeleteI just tested that myself on a fresh build :)
DeleteIt seems it automatically picks the highest available number, in my case 5 which seems to be max_worker_processes - autovacuum_max_workers (8 - 3 on a default install).
Fantastic!
ReplyDeleteVibhor Kumar shared the news yesterday, that today made HackerNews home page. This is a huge game changer! I'm pretty sure it also impacts development and testing complexity. I hope it evolves into a horizontal scalability solution sometime soon. Thanks everyone!
ReplyDeletevery nice!
ReplyDeleteData point: for my work improvement partitioning support is more important than supporting aggregations. Also indexonly scan is pretty important, too.
I suggest you to use some kind of "Like" or "+1", maybe there will be way more people here just to say: thank you guys!
ReplyDeleteIs it possible to enable this on a per-query basis? Or is the threading overhead small enough to not make a difference when you're dealing with a small dataset?
ReplyDeleteAs shown in the example above it can be set inside a session.
Deletehow large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?
ReplyDeletepostgres=# set parallel_tuple_cost=0.00001;
ReplyDeleteSET
postgres=# set max_parallel_degree=2;
SET
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..0.01 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.test (cost=0.00..0.00 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1)
Output: id
Planning time: 2.815 ms
Execution time: 0.097 ms
(6 rows)
postgres=# insert into test select generate_series(1,10000000);
INSERT 0 10000000
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q
OOM occure.
use less data
ReplyDeletepostgres=# insert into test select generate_series(1,1000000);
INSERT 0 1000000
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=9611.49..9611.50 rows=1 width=0) (actual time=1613.600..1613.600 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=4692
-> Gather (cost=1000.00..7011.38 rows=1040045 width=0) (actual time=6.200..1470.014 rows=1040000 loops=1)
Output: id
Number of Workers: 2
Buffers: shared hit=4692
-> Parallel Seq Scan on public.test (cost=0.00..6000.98 rows=1040045 width=0) (actual time=0.023..489.118 rows=1774500 loops=1)
Output: id
Buffers: shared hit=7852
Planning time: 0.081 ms
Execution time: 1614.764 ms
(12 rows)
postgres=# set max_parallel_degree=0;
SET
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=17602.56..17602.57 rows=1 width=0) (actual time=171.148..171.148 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=4602
-> Seq Scan on public.test (cost=0.00..15002.45 rows=1040045 width=0) (actual time=0.033..87.773 rows=1040000 loops=1)
Output: id
Buffers: shared hit=4602
Planning time: 0.101 ms
Execution time: 171.208 ms
(8 rows)
postgres=# \dt+
ReplyDeleteList of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+-------+-------------
public | test | table | postgres | 36 MB |
(1 row)
how large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?
ReplyDeletewhat a boss; outstanding!!! Where do I donate :)?
ReplyDeleteThat's awesome, thank you for your work!
ReplyDeleteExcellent news. You made _The Reg_
ReplyDeletehttp://www.theregister.co.uk/2015/11/13/postgresql_learns_to_walk_and_chew_gum/
Oh god I'm so excited. Great work!
ReplyDeleteDo you have plans to make this work for index scans?
ReplyDeleteRobert and everyone who helped, Thank you!
ReplyDeleteThe need for this has been around for a long time, I am so happy to see progress.
i'vet max_parallel_degree = 4, but only 2 workers running, why? (i have 4 cores)
ReplyDeletepostgres=# explain analyse select * from foo where val = .4678;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..7960.90 rows=1 width=12) (actual time=79.873..79.873 rows=0 loops=1)
Number of Workers: 2
-> Parallel Seq Scan on foo (cost=0.00..6960.80 rows=1 width=12) (actual time=63.170..359.470 rows=0 loops=1)
Filter: (val = '0.4678'::double precision)
Rows Removed by Filter: 1633820
Planning time: 0.090 ms
Execution time: 81.180 ms
got the answer: it depends on the number of rows within the table ;-)
DeleteCongratz! =)
ReplyDeleteThe ability to "nice" queries (without lock starvation), so that analysts or other time-insensitive queries could be made not to starve important OLTP work of resources
ReplyDeleteHow can we implement parallel queries in 9.4 version. I tried finding all but all blog refer to 9.6.
ReplyDeleteActually we will be scaling more than 1000000 users so we need to improve performance for concurrent users as the response time is almost shoot to 15 s per request, which gradually goes to time out.
Can you please advice?
Parallel query isn't supported on PostgreSQL 9.4. You must upgrade to PostgreSQL 9.6 or, better yet, 10. See http://rhaas.blogspot.com/2017/04/new-features-coming-in-postgresql-10.html
Delete