Wednesday, November 11, 2015

Parallel Sequential Scan is Committed!

I previously suggested that we might be able to get parallel sequential scan committed to PostgreSQL 9.5.  That did not happen.  However, I'm pleased to report that I've just committed the first version of parallel sequential scan to PostgreSQL's master branch, with a view toward having it included in the upcoming PostgreSQL 9.6 release.

Parallel query for PostgreSQL - for which this is the first step - has been a long-time dream of mine, and I have been working on it for several years, starting really in the PostgreSQL 9.4 release cycle, where I added dynamic background workers and dynamic shared memory; continuing through the PostgreSQL 9.5 release cycle, where I put in place a great deal of additional fundamental infrastructure for parallelism; and most recently today's commits.   I'd like to tell you a little bit about today's commits, and what comes next.

But first, I'd like to give credit where credit is due.  First, Amit Kapila has been a tremendous help in completing this project.  Both Amit and I wrote large amounts of code that ended up being part of this feature, and that code is spread across many commits over the last several years.  Both of us also write large amounts of code that did not end up being part of what got committed.   Second, I'd like to thank Noah Misch, who helped me very much in the early stages of this project, when I was trying to get my heads around the problems that needed to be solved.  Third, I'd like to thank the entire PostgreSQL community and in particular all of the people who helped review and test patches, suggested improvements, and in many other ways made this possible.

Just as importantly, however, I'd like to thank EnterpriseDB.  Without the support of EnterpriseDB management, first Tom Kincaid and more recently Marc Linster, among others, it would not have been possible for me to devote the amount of my time and Amit's time to this project that was necessary to make it a success.  Equally, without the support of my team at EnterpriseDB, who have patiently covered for me in many ways whenever I was too busy with this work to handle other issues, this project could not have gotten done.  Thanks to all.

OK, time for a demo:

rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
 aid | bid | abalance | filler
-----+-----+----------+--------

(0 rows)

Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
 aid | bid | abalance | filler
-----+-----+----------+-------
-
(0 rows)

Time: 213.412 ms

Here's how the plan looks:

rhaas=# explain (costs off) select * from pgbench_accounts where filler like '%a%';
                 QUERY PLAN                 
---------------------------------------------
 Gather
   Number of Workers: 4
   ->  Parallel Seq Scan on pgbench_accounts
         Filter: (filler ~~ '%a%'::text)
(4 rows)


The Gather node launches a number of workers, and those workers all execute the subplan in parallel.  Because the subplan is a Parallel Seq Scan rather than an ordinary Seq Scan, the workers coordinate with each other so that each block in the relation is scanned just once.  Each worker therefore produces on a subset of the final result set, and the Gather node collects all of those results.

One rather enormous limitation of the current feature is that we only generate Gather nodes immediately on top of Parallel Seq Scan nodes.  This means that this feature doesn't currently work for inheritance hierarchies (which are used to implement partitioned tables) because there would be an Append node in between.  Nor is it possible to push a join down into the workers at present.  The executor infrastructure is capable of running plans of either type, but the planner is currently too stupid to generate them.  This is something I'm hoping to fix before we run out of time in the 9.6 release cycle, but we'll see how that goes.  With things as they are, about the only case that benefits from this feature is a sequential scan of a table that cannot be index-accelerated but can be made faster by having multiple workers test the filter condition in parallel.  Pushing joins beneath the Gather node would make this much more widely applicable.

Also, my experience so far is that adding a few workers tends to help a lot, but the benefits do not scale very well to a large number of workers.  More investigation is needed to figure out why this is happening and how to improve on it.  As you can see, even a few workers can improve performance quite a bit, so this isn't quite so critical to address as the previous limitation.  However, it would be nice to improve it as much as we can; CPU counts are growing all the time!

Finally, I'd like to note that there are still a number of loose ends that need to be tied up before we can really call this feature, even in basic form, totally done.  There are, likely, also bugs.  Testing is very much appreciated, and please report issues you find to pgsql-hackers (at) postgresql.org.  Thanks.

58 comments:

  1. Finally something user-visible! Thank you guys.

    ReplyDelete
  2. I'm crying out for parallel query so much I'm tempted to go back helping to test. Thanks much needed feature.

    ReplyDelete
  3. Bloody fantastic. Will this also work with partitioned tables?

    ReplyDelete
    Replies
    1. Not yet. That's on my short list, but { long discussion of why that's not trivial omitted }.

      Delete
  4. could you please talk about any relationship between max_parallel_degree and effective_io_concurrency?

    ReplyDelete
    Replies
    1. There is basically no relation between those things. effective_io_concurrency controls prefetch depth for bitmap index scans. max_parallel_degree limits the number of workers per Gather node.

      Delete
  5. Does that work with aggregates?

    ReplyDelete
    Replies
    1. A parallel sequential scan can feed an aggregate, but it is not yet possible to do the aggregation itself in parallel. Obviously, that's a limitation I'd like to lift at some point.

      Delete
  6. If i set max_parallel_degree = 4 so have 4 workers. Are they using separate cores of my machine?

    ReplyDelete
    Replies
    1. That's up to the operating system scheduler, but presumably yes. If the total number of processes trying to run on the machine is greater than the number of CPU cores, then they can't all run simultaneously, so in that case you could potentially end up with workers that aren't actually running, which would be just a waste of resources. But assuming that's not the case, the system will presumably run all the workers at once.

      Delete
  7. Congratulations Robert! Incredible determination by you Noah, Amit and the entire PostgreSQL community. Very happy for all of you!! I will definitely download and play with it in the near future.

    ReplyDelete
    Replies
    1. Tom, let me be the first commenter to say thanks to EnterpriseDB, even great free software coders need to eat.

      Delete
  8. How do we deal with varying I/O overhead?
    E.g. on a 16 core box I'd love to have 10+ workers doing parallel sequential scan over data that is already in memory.
    But in case the data must be read from storage I want to throttle workers count so that they do not compete for I/O. Micromanagement such details for each query seems to be tedious.
    Is there any chance PostgreSQL would figure out best plan (vary workers count) automatically?

    ReplyDelete
    Replies
    1. There is a ton of interesting research work to do be done to solve problems like this. Right now, it's very much unsophisticated. The emphasis up until now has been on getting it to work at all - from here, we need to shift to more of a performance focus, but ironing out complicated problems like this will take time and, importantly, reports of real-world experiences from the field.

      Delete
    2. I agree with anonymous and I hope you get time to make some progress on optimizing the number of workers for a given query. The `max_parallel_degree` setting should be a maximum number of workers as the name implies, not a fixed number. Presume this is why you gave it this name?

      Delete
  9. seems you guys are trying to match citus with their efforts in PSS and pushing down queries to the worker nodes.

    are you guys looking to match their features 100% or which features can't you do as of 9.6 or are looking to do with 9.8?

    either way, I applaud this as scaling out postgres is a much needed feature!

    ReplyDelete
    Replies
    1. Parallel query is more of a vertical scalability solution than a horizontal scalability solution. The goal is to effectively use multiple CPUs on a single machine, not multiple machines as CitusDB does. Using multiple machines would be a cool project, too, but it wouldn't be this project, even though many of the query optimization problems are related.

      Delete
    2. For horizontal scalabilityI would rather refer to Postgres-X2, and esp XL which has similar Scatter-Gather approach with the optimization that each worker node can push down the work to other worker nodes, so it is more like a mesh/grid and not a tree. I do hope XL, X2 and core will be merged at some point. Robert, I missed the last eu pgconf unconference - I think it was supposed to be discussed there?
      Big thanks for your work of course!

      Delete
    3. @Robert Haas. CitusDB does vertical scalability out of the box for the worker nodes. So if you had 4 x 4 core boxes. You have at your disposal 16 cores working in unison for queries.

      But for open source purposes, having PSS available. I cannot wait :)

      Delete
  10. > adding a few workers tends to help a lot, but the benefits do not scale very well to a large number of workers.
    This is because with a large number of concurrent readings you start to get IO waiting. Basically, for a data stored on a single disk, 4 workers would be optimal if: (time of reading a block of data from disk) == 3 x (time of processing a block of data).

    ReplyDelete
    Replies
    1. But isn't a DB sitting on a single (rotational) disk rather small-potatoes these days?

      Delete
    2. Yes it is, but the OS is actually a fairly big bottleneck on IO these days, and until that overhead is solved we can't really take advantage of our arrays of ram disks and SSDs

      Delete
  11. Congratulations PostgreSQL ! Thanks a lot Robert, Amit and Noah !

    ReplyDelete
  12. Is it possible to use parallel sequential scan in vaccuum?

    ReplyDelete
  13. Thanks to make it in!! With all you guys I can hope great future to PostgreSQL and In coming days PostgreSQL will be best in performance among all top RDBMS.

    ReplyDelete
  14. this is really great feature!
    do you think it makes sense to put in a setting on total number of parallel workers?
    my concern is that if each query can spawn 4 threads and we get 100 connections - we may end up with 400 threads..
    it would be nice to have a gloabal limit on PQs.
    then we would not end up like work_mem which can eat all your memory..
    thanks

    ReplyDelete
    Replies
    1. max_worker_processes limits the total number of worker processes that can exist in the system at any one time for any purpose. Parallel worker processes count against this limit, as do any other background worker processes that are running.

      Delete
    2. What will happen to a query requesting parallel worker processes if the limit is reached?

      Delete
    3. I just tested that myself on a fresh build :)
      It seems it automatically picks the highest available number, in my case 5 which seems to be max_worker_processes - autovacuum_max_workers (8 - 3 on a default install).

      Delete
  15. Vibhor Kumar shared the news yesterday, that today made HackerNews home page. This is a huge game changer! I'm pretty sure it also impacts development and testing complexity. I hope it evolves into a horizontal scalability solution sometime soon. Thanks everyone!

    ReplyDelete
  16. very nice!

    Data point: for my work improvement partitioning support is more important than supporting aggregations. Also indexonly scan is pretty important, too.

    ReplyDelete
  17. I suggest you to use some kind of "Like" or "+1", maybe there will be way more people here just to say: thank you guys!

    ReplyDelete
  18. Is it possible to enable this on a per-query basis? Or is the threading overhead small enough to not make a difference when you're dealing with a small dataset?

    ReplyDelete
    Replies
    1. As shown in the example above it can be set inside a session.

      Delete
  19. how large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?

    ReplyDelete
  20. postgres=# set parallel_tuple_cost=0.00001;
    SET
    postgres=# set max_parallel_degree=2;
    SET
    postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------
    Aggregate (cost=0.00..0.01 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
    Output: count(*)
    -> Seq Scan on public.test (cost=0.00..0.00 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1)
    Output: id
    Planning time: 2.815 ms
    Execution time: 0.097 ms
    (6 rows)

    postgres=# insert into test select generate_series(1,10000000);
    INSERT 0 10000000
    postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
    WARNING: terminating connection because of crash of another server process
    DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
    HINT: In a moment you should be able to reconnect to the database and repeat your command.
    server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
    The connection to the server was lost. Attempting reset: Failed.
    !> \q
    OOM occure.

    ReplyDelete
  21. use less data
    postgres=# insert into test select generate_series(1,1000000);
    INSERT 0 1000000
    postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
    QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=9611.49..9611.50 rows=1 width=0) (actual time=1613.600..1613.600 rows=1 loops=1)
    Output: count(*)
    Buffers: shared hit=4692
    -> Gather (cost=1000.00..7011.38 rows=1040045 width=0) (actual time=6.200..1470.014 rows=1040000 loops=1)
    Output: id
    Number of Workers: 2
    Buffers: shared hit=4692
    -> Parallel Seq Scan on public.test (cost=0.00..6000.98 rows=1040045 width=0) (actual time=0.023..489.118 rows=1774500 loops=1)
    Output: id
    Buffers: shared hit=7852
    Planning time: 0.081 ms
    Execution time: 1614.764 ms
    (12 rows)

    postgres=# set max_parallel_degree=0;
    SET
    postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------
    Aggregate (cost=17602.56..17602.57 rows=1 width=0) (actual time=171.148..171.148 rows=1 loops=1)
    Output: count(*)
    Buffers: shared hit=4602
    -> Seq Scan on public.test (cost=0.00..15002.45 rows=1040045 width=0) (actual time=0.033..87.773 rows=1040000 loops=1)
    Output: id
    Buffers: shared hit=4602
    Planning time: 0.101 ms
    Execution time: 171.208 ms
    (8 rows)

    ReplyDelete
  22. postgres=# \dt+
    List of relations
    Schema | Name | Type | Owner | Size | Description
    --------+------+-------+----------+-------+-------------
    public | test | table | postgres | 36 MB |
    (1 row)

    ReplyDelete
  23. how large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?

    ReplyDelete
  24. what a boss; outstanding!!! Where do I donate :)?

    ReplyDelete
  25. That's awesome, thank you for your work!

    ReplyDelete
  26. Excellent news. You made _The Reg_
    http://www.theregister.co.uk/2015/11/13/postgresql_learns_to_walk_and_chew_gum/

    ReplyDelete
  27. Oh god I'm so excited. Great work!

    ReplyDelete
  28. Do you have plans to make this work for index scans?

    ReplyDelete
  29. Robert and everyone who helped, Thank you!

    The need for this has been around for a long time, I am so happy to see progress.

    ReplyDelete
  30. i'vet max_parallel_degree = 4, but only 2 workers running, why? (i have 4 cores)

    postgres=# explain analyse select * from foo where val = .4678;
    QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------
    Gather (cost=1000.00..7960.90 rows=1 width=12) (actual time=79.873..79.873 rows=0 loops=1)
    Number of Workers: 2
    -> Parallel Seq Scan on foo (cost=0.00..6960.80 rows=1 width=12) (actual time=63.170..359.470 rows=0 loops=1)
    Filter: (val = '0.4678'::double precision)
    Rows Removed by Filter: 1633820
    Planning time: 0.090 ms
    Execution time: 81.180 ms


    ReplyDelete
    Replies
    1. got the answer: it depends on the number of rows within the table ;-)

      Delete
  31. The ability to "nice" queries (without lock starvation), so that analysts or other time-insensitive queries could be made not to starve important OLTP work of resources

    ReplyDelete