Robert Haas: Parallel Sequential Scan is Committed!

Wednesday, November 11, 2015

Parallel Sequential Scan is Committed!

I previously suggested that we might be able to get parallel sequential scan committed to PostgreSQL 9.5. That did not happen. However, I'm pleased to report that I've just committed the first version of parallel sequential scan to PostgreSQL's master branch, with a view toward having it included in the upcoming PostgreSQL 9.6 release.

Parallel query for PostgreSQL - for which this is the first step - has been a long-time dream of mine, and I have been working on it for several years, starting really in the PostgreSQL 9.4 release cycle, where I added dynamic background workers and dynamic shared memory; continuing through the PostgreSQL 9.5 release cycle, where I put in place a great deal of additional fundamental infrastructure for parallelism; and most recently today's commits. I'd like to tell you a little bit about today's commits, and what comes next.

But first, I'd like to give credit where credit is due. First, Amit Kapila has been a tremendous help in completing this project. Both Amit and I wrote large amounts of code that ended up being part of this feature, and that code is spread across many commits over the last several years. Both of us also write large amounts of code that did not end up being part of what got committed. Second, I'd like to thank Noah Misch, who helped me very much in the early stages of this project, when I was trying to get my heads around the problems that needed to be solved. Third, I'd like to thank the entire PostgreSQL community and in particular all of the people who helped review and test patches, suggested improvements, and in many other ways made this possible.

Just as importantly, however, I'd like to thank EnterpriseDB. Without the support of EnterpriseDB management, first Tom Kincaid and more recently Marc Linster, among others, it would not have been possible for me to devote the amount of my time and Amit's time to this project that was necessary to make it a success. Equally, without the support of my team at EnterpriseDB, who have patiently covered for me in many ways whenever I was too busy with this work to handle other issues, this project could not have gotten done. Thanks to all.

OK, time for a demo:

rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------

(0 rows)

Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 213.412 ms

Here's how the plan looks:

rhaas=# explain (costs off) select * from pgbench_accounts where filler like '%a%';
                 QUERY PLAN
---------------------------------------------
Gather
   Number of Workers: 4
   -> Parallel Seq Scan on pgbench_accounts
         Filter: (filler ~~ '%a%'::text)
(4 rows)

The Gather node launches a number of workers, and those workers all execute the subplan in parallel. Because the subplan is a Parallel Seq Scan rather than an ordinary Seq Scan, the workers coordinate with each other so that each block in the relation is scanned just once. Each worker therefore produces on a subset of the final result set, and the Gather node collects all of those results.

One rather enormous limitation of the current feature is that we only generate Gather nodes immediately on top of Parallel Seq Scan nodes. This means that this feature doesn't currently work for inheritance hierarchies (which are used to implement partitioned tables) because there would be an Append node in between. Nor is it possible to push a join down into the workers at present. The executor infrastructure is capable of running plans of either type, but the planner is currently too stupid to generate them. This is something I'm hoping to fix before we run out of time in the 9.6 release cycle, but we'll see how that goes. With things as they are, about the only case that benefits from this feature is a sequential scan of a table that cannot be index-accelerated but can be made faster by having multiple workers test the filter condition in parallel. Pushing joins beneath the Gather node would make this much more widely applicable.

Also, my experience so far is that adding a few workers tends to help a lot, but the benefits do not scale very well to a large number of workers. More investigation is needed to figure out why this is happening and how to improve on it. As you can see, even a few workers can improve performance quite a bit, so this isn't quite so critical to address as the previous limitation. However, it would be nice to improve it as much as we can; CPU counts are growing all the time!

Finally, I'd like to note that there are still a number of loose ends that need to be tied up before we can really call this feature, even in basic form, totally done. There are, likely, also bugs. Testing is very much appreciated, and please report issues you find to pgsql-hackers (at) postgresql.org. Thanks.

59 comments:

a-altairNovember 11, 2015 11:36 AM
9.6 NO *CRY**CRY**CRY*
ReplyDelete
Replies
AnonymousNovember 11, 2015 12:17 PM
+10;
ReplyDelete
Replies
filpremNovember 11, 2015 1:42 PM
Finally something user-visible! Thank you guys.
ReplyDelete
Replies
AnonymousNovember 11, 2015 1:56 PM
I'm crying out for parallel query so much I'm tempted to go back helping to test. Thanks much needed feature.
ReplyDelete
Replies
AnonymousNovember 11, 2015 2:12 PM
Bloody fantastic. Will this also work with partitioned tables?
ReplyDelete
Replies
AnonymousNovember 11, 2015 2:43 PM
could you please talk about any relationship between max_parallel_degree and effective_io_concurrency?
ReplyDelete
Replies
AnonymousNovember 11, 2015 2:48 PM
Does that work with aggregates?
ReplyDelete
Replies
AnonymousNovember 11, 2015 3:11 PM
Thank you!
ReplyDelete
Replies
UnknownNovember 11, 2015 4:23 PM
If i set max_parallel_degree = 4 so have 4 workers. Are they using separate cores of my machine?
ReplyDelete
Replies
Tom KincaidNovember 11, 2015 6:26 PM
Congratulations Robert! Incredible determination by you Noah, Amit and the entire PostgreSQL community. Very happy for all of you!! I will definitely download and play with it in the near future.
ReplyDelete
Replies
AnonymousNovember 11, 2015 10:42 PM
How do we deal with varying I/O overhead?
E.g. on a 16 core box I'd love to have 10+ workers doing parallel sequential scan over data that is already in memory.
But in case the data must be read from storage I want to throttle workers count so that they do not compete for I/O. Micromanagement such details for each query seems to be tedious.
Is there any chance PostgreSQL would figure out best plan (vary workers count) automatically?
ReplyDelete
Replies
UnknownNovember 11, 2015 10:46 PM
seems you guys are trying to match citus with their efforts in PSS and pushing down queries to the worker nodes.

are you guys looking to match their features 100% or which features can't you do as of 9.6 or are looking to do with 9.8?

either way, I applaud this as scaling out postgres is a much needed feature!
ReplyDelete
Replies
AnonymousNovember 12, 2015 12:36 AM
> adding a few workers tends to help a lot, but the benefits do not scale very well to a large number of workers.
This is because with a large number of concurrent readings you start to get IO waiting. Basically, for a data stored on a single disk, 4 workers would be optimal if: (time of reading a block of data from disk) == 3 x (time of processing a block of data).
ReplyDelete
Replies
AnonymousNovember 12, 2015 2:30 AM
Congratulations PostgreSQL ! Thanks a lot Robert, Amit and Noah !
ReplyDelete
Replies
AnonymousNovember 12, 2015 2:40 AM
Is it possible to use parallel sequential scan in vaccuum?
ReplyDelete
Replies
AnonymousNovember 12, 2015 2:53 AM
Thanks to make it in!! With all you guys I can hope great future to PostgreSQL and In coming days PostgreSQL will be best in performance among all top RDBMS.
ReplyDelete
Replies
SlavaNovember 12, 2015 3:38 AM
this is really great feature!
do you think it makes sense to put in a setting on total number of parallel workers?
my concern is that if each query can spawn 4 threads and we get 100 connections - we may end up with 400 threads..
it would be nice to have a gloabal limit on PQs.
then we would not end up like work_mem which can eat all your memory..
thanks
ReplyDelete
Replies
JacekNovember 12, 2015 5:46 AM
Fantastic!
ReplyDelete
Replies
Vlad SanchezNovember 12, 2015 8:48 AM
Vibhor Kumar shared the news yesterday, that today made HackerNews home page. This is a huge game changer! I'm pretty sure it also impacts development and testing complexity. I hope it evolves into a horizontal scalability solution sometime soon. Thanks everyone!
ReplyDelete
Replies
Merlin MoncureNovember 12, 2015 10:01 AM
very nice!

Data point: for my work improvement partitioning support is more important than supporting aggregations. Also indexonly scan is pretty important, too.
ReplyDelete
Replies
Daniel CristianNovember 12, 2015 10:22 AM
I suggest you to use some kind of "Like" or "+1", maybe there will be way more people here just to say: thank you guys!
ReplyDelete
Replies
ToastBustersNovember 12, 2015 12:16 PM
Is it possible to enable this on a per-query basis? Or is the threading overhead small enough to not make a difference when you're dealing with a small dataset?
ReplyDelete
Replies
AnonymousNovember 12, 2015 10:43 PM
how large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?
ReplyDelete
Replies
UnknownNovember 13, 2015 12:47 AM
postgres=# set parallel_tuple_cost=0.00001;
SET
postgres=# set max_parallel_degree=2;
SET
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..0.01 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.test (cost=0.00..0.00 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1)
Output: id
Planning time: 2.815 ms
Execution time: 0.097 ms
(6 rows)

postgres=# insert into test select generate_series(1,10000000);
INSERT 0 10000000
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q
OOM occure.
ReplyDelete
Replies
UnknownNovember 13, 2015 12:47 AM
use less data
postgres=# insert into test select generate_series(1,1000000);
INSERT 0 1000000
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=9611.49..9611.50 rows=1 width=0) (actual time=1613.600..1613.600 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=4692
-> Gather (cost=1000.00..7011.38 rows=1040045 width=0) (actual time=6.200..1470.014 rows=1040000 loops=1)
Output: id
Number of Workers: 2
Buffers: shared hit=4692
-> Parallel Seq Scan on public.test (cost=0.00..6000.98 rows=1040045 width=0) (actual time=0.023..489.118 rows=1774500 loops=1)
Output: id
Buffers: shared hit=7852
Planning time: 0.081 ms
Execution time: 1614.764 ms
(12 rows)

postgres=# set max_parallel_degree=0;
SET
postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=17602.56..17602.57 rows=1 width=0) (actual time=171.148..171.148 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=4602
-> Seq Scan on public.test (cost=0.00..15002.45 rows=1040045 width=0) (actual time=0.033..87.773 rows=1040000 loops=1)
Output: id
Buffers: shared hit=4602
Planning time: 0.101 ms
Execution time: 171.208 ms
(8 rows)
ReplyDelete
Replies
UnknownNovember 13, 2015 12:48 AM
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+-------+-------------
public | test | table | postgres | 36 MB |
(1 row)
ReplyDelete
Replies
AnonymousNovember 13, 2015 1:33 AM
how large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?
ReplyDelete
Replies
UnknownNovember 13, 2015 1:38 AM
what a boss; outstanding!!! Where do I donate :)?
ReplyDelete
Replies
AnonymousNovember 13, 2015 5:53 AM
That's awesome, thank you for your work!
ReplyDelete
Replies
Mark CallaghanNovember 13, 2015 11:58 AM
Excellent news. You made _The Reg_
http://www.theregister.co.uk/2015/11/13/postgresql_learns_to_walk_and_chew_gum/
ReplyDelete
Replies
AnonymousNovember 13, 2015 1:38 PM
Oh god I'm so excited. Great work!
ReplyDelete
Replies
AnonymousNovember 13, 2015 2:42 PM
Do you have plans to make this work for index scans?
ReplyDelete
Replies
AnonymousNovember 13, 2015 4:05 PM
Robert and everyone who helped, Thank you!

The need for this has been around for a long time, I am so happy to see progress.

ReplyDelete
Replies
akretschmerNovember 15, 2015 6:34 AM
i'vet max_parallel_degree = 4, but only 2 workers running, why? (i have 4 cores)

postgres=# explain analyse select * from foo where val = .4678;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..7960.90 rows=1 width=12) (actual time=79.873..79.873 rows=0 loops=1)
Number of Workers: 2
-> Parallel Seq Scan on foo (cost=0.00..6960.80 rows=1 width=12) (actual time=63.170..359.470 rows=0 loops=1)
Filter: (val = '0.4678'::double precision)
Rows Removed by Filter: 1633820
Planning time: 0.090 ms
Execution time: 81.180 ms

ReplyDelete
Replies
RWRZNovember 18, 2015 12:08 PM
Congratz! =)
ReplyDelete
Replies
UnknownJanuary 13, 2016 2:44 AM
The ability to "nice" queries (without lock starvation), so that analysts or other time-insensitive queries could be made not to starve important OLTP work of resources
ReplyDelete
Replies
UnknownJanuary 16, 2018 1:31 AM
How can we implement parallel queries in 9.4 version. I tried finding all but all blog refer to 9.6.
Actually we will be scaling more than 1000000 users so we need to improve performance for concurrent users as the response time is almost shoot to 15 s per request, which gradually goes to time out.

Can you please advice?
ReplyDelete
Replies

Add comment