Thursday, April 21, 2016

PostgreSQL 9.6 with Parallel Query vs. TPC-H

I decided to try out parallel query, as implemented in PostgreSQL 9.6devel, on the TPC-H queries.  To do this, I followed the directions at https://github.com/tvondra/pg_tpch - thanks to Tomas Vondra for those instructions.  I did the test on an IBM POWER7 server provided to the PostgreSQL community by IBM.  I scaled the database to use 10GB of input data; the resulting database size was 22GB, of which 8GB was indexes.  I tried out each query just once without really tuning the database at all, except for increasing shared_buffers to 8GB.  Then I tested them again after enabling parallel query by configuring max_parallel_degree = 4.

Of the 22 queries, 17 switched to a parallel plan, while the plans for the other 5 were unchanged.  Of the 17 queries where the plan changed, 15 got faster, 1 ran at the same speed, and 1 got slower.  11 of the queries ran at least twice as fast with parallelism as they did without parallelism.  Here are the comparative results for the queries where the plan changed:

Q1: 229 seconds → 45 seconds (5.0x)
Q3: 45 seconds → 17 seconds (2.6x)
Q4: 12 seconds → 3 seconds (4.0x)
Q5: 38 seconds → 17 seconds (2.2x)
Q6: 17 seconds → 6 seconds (2.8x)
Q7: 41 seconds → 12 seconds (3.4x)
Q8: 10 seconds → 4 seconds (2.5x)
Q9: 81 seconds → 61 seconds (1.3x)
Q10: 37 seconds → 18 seconds (2.0x)
Q12: 34 seconds → 7 seconds (4.8x)
Q15: 33 seconds → 24 seconds (1.3x)
Q16: 17 seconds → 16 seconds (1.0x)
Q17: 140 seconds → 55 seconds (2.5x)
Q19: 2 seconds → 1 second (2.0x)
Q20: 70 seconds → 70 seconds (1.0x)
Q21: 80 seconds → 99 seconds (0.8x)
Q22: 4 seconds → 3 seconds (1.3x)

Linear scaling with a leader process and 4 workers would mean a 5.0x speedup, which we achieved in only one case.  However, for many users, that won't matter: if you have CPUs that would otherwise be sitting idle, it's better to get some speedup than no speedup at all.

Of course, I couldn't resist analyzing what went wrong here, especially for Q21, which actually got slower.  To some degree, that's down to misconfiguration: I ran this test with the default value of work_mem=4MB, but Q21 chooses a plan that builds a hash table on the largest table in the database, which is about 9.5GB in this test.   Therefore, it ends up doing a 1024-batch hash join, which is somewhat painful under the best of circumstances.  With work_mem=1GB, the regression disappears, and it's 6% faster with parallel query than without.

However, there's a deeper problem, which is that while PostgreSQL 9.6 can perform a hash join in parallel, each process must build its own copy of the hash table.   That means we use N times the CPU and N times the memory, and we may induce I/O contention, locking contention, or memory pressure as well.  It would be better to have the ability to build a shared hash table, and EnterpriseDB is working on that as a feature, but it won't be ready in time for PostgreSQL 9.6, which is already in feature freeze.  Since Q21 needs a giant hash table, this limitation really stings.  In fact, there are a number of queries here where it seems like building a shared hash table would speed things up significantly: Q3, Q5, Q7, Q8, and Q21.

An even more widespread problem is that, at present, the driving table for a parallel query must be accessed via a parallel sequential scan; that's the only operation we have that can partition the input data.  Many of these queries - Q4, Q5, Q6, Q7, Q14, Q15, and Q20 - would have been better off using a bitmap index scan on the driving table, but unfortunately that's not supported in PostgreSQL 9.6.  We still come out ahead on these queries in terms of runtime because the system simply substitutes raw power for finesse: with enough workers, we can scan the whole table quicker than a single process can scan the portion identified as relevant by the index.  However, it would clearly be nice to do better.

Four queries - Q2, Q15, Q16, Q22 - were parallelized either not at all or only to a limited degree due to restrictions related to the handling of subqueries, about which the current implementation of parallel query is not always smart.  Three queries - Q2, Q13, and Q15 - made no or limited use of parallelism because the optimal join strategy is a merge join, which can't be made parallel in a trivial way.  One query - Q17 - managed to perform the same an expensive sort twice, once in the workers and then again in the leader.  This is because the Gather operation reads tuples from the workers in an arbitrary and not necessarily predictable order; so even if each worker's stream of tuples is sorted, the way those streams get merged together will probably destroy the sort ordering.

There are no doubt other issues here that I haven't found yet, but on the whole I find these results pretty encouraging.  Parallel query basically works, and makes queries that someone thought were representative of real workloads significantly faster.  There's a lot of room for further improvement, but that's likely to be true of the first version of almost any large feature.

(If you liked this post, you may want to read my previous blog posts on parallelism.)

10 comments:

  1. Impressive! Could you please run the same test for different values of max_parallel_degree? I would love to see how does it behaves.

    Thank you!

    ReplyDelete
  2. In your list of queries (Q1 - Q20), Q2 is missing.

    ReplyDelete
    Replies
    1. That's one of the ones for which the plan didn't change. I didn't show those in the list.

      Delete
  3. >Many of these queries - Q4, Q5, Q6, Q7, Q14, Q15, and Q20 -
    >would have been better off using a bitmap index scan
    >on the driving table, but unfortunately that's
    >not supported in PostgreSQL 9.6.

    the new Bloom index (9.6) can help here ?
    http://www.postgresql.org/docs/devel/static/bloom.html

    ReplyDelete
    Replies
    1. No, the problem is we don't support any sort of index scan on the driving table for parallel queries yet.

      Delete
  4. In the MS SQL Server world, they've supported adjustable parallelism for more than a decade, and...it has stung us from time to time.

    In a complex query, both in real time and ETL/Batch scenarios with parallelism enabled (query hint option maxdop = n), we'd start seeing cxpacket wait types appearing. Research showed that a query split into, say, 5 threads/cores, would wait for the longest running thread to complete before continuing to the next step...and sometimes that long running thread would be dependent on something that took a long time to resolve so it looked liked the query had stalled. Our only solution for that type of query was to force it to run on a single thread/core (maxdop=1).

    Hopefully the new Postgres parallelism finds a way to avoid that, since I've got some very large, complex queries coming up for dev and it will be interesting to see if they can make it through faster via parallel.

    Great article and research, Robert! You are always meticulous and detailed, and I appreciate the effort you take to create these reports.

    ReplyDelete
  5. Hello, can you tell me how and why you selected your queries? Why 22 and what is specific for each one of them? Are they just some random queries?

    ReplyDelete
    Replies
    1. They are the queries from the TPC-H benchmark.

      Delete
  6. Hi dear Hass

    My question is how to set parallel query in Windows OS?

    I set max_parallel_degree to 4 ,but no execute with parallel query.

    Thank you

    ReplyDelete
    Replies
    1. There's no difference in the correct method of configuring this on Linux vs. on Windows. It may be, however, that your queries are not parallel-safe (e.g. because they write data or call a parallel-unsafe function) or that the planner decided there was no parallel plan faster than the serial plan.

      Delete