Thursday, June 21, 2018

Using force_parallel_mode Correctly

I admit it: I invented force_parallel_mode.  I believed then, and still believe now, that it is valuable for testing purposes.  Certainly, testing using force_parallel_mode=on or force_parallel_mode=regress has uncovered many bugs in PostgreSQL's parallel query support that would otherwise have been very difficult to find.  At the same time, it's pretty clear that this setting has caused enormous confusion, even among PostgreSQL experts.  In fact, in my experience, almost everyone who sets force_parallel_mode is doing so for the wrong reasons.

At least in my experience, what typically happens is that some query which a user believes should have used parallel query does not do so.  The user then tries to fix the problem by setting force_parallel_mode=on.  In reality, this does not solve the problem at all.  It only makes things worse.  However, it sometimes gives users the impression that they have solved the problem, which may be why people keep doing it (or maybe we need better documentation -- suggestions welcome).  Let's taken an example.


rhaas=# explain select * from pgbench_accounts;
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on pgbench_accounts  (cost=0.00..105574.00 rows=4000000 width=97)
(1 row)

rhaas=# set force_parallel_mode = on;
SET
rhaas=# explain select * from pgbench_accounts;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Gather  (cost=1000.00..506574.00 rows=4000000 width=97)
   Workers Planned: 1
   Single Copy: true
   ->  Seq Scan on pgbench_accounts  (cost=0.00..105574.00 rows=4000000 width=97)
(4 rows)

There's a very good reason why the query planner doesn't try to use parallel query here: it won't make the query run faster.  Parallel query speeds up queries where the amount of work that must be performed is large compared to the number of output tuples.  Here, because there are no joins, no filter conditions, and no aggregates, there is very little work to do per output tuple: the system just needs to fetch each one and send it to the user.  Adding parallel workers won't make that any faster.  A single process is already very capable of sending tuples to the user as fast as the user can read them.  And, even if it isn't, adding more workers won't help.  When you do, instead of the main process needing to fetch each tuple from the table, it will need to fetch each tuple from a worker.  It's not really doing any less work.  And the workers are now using up memory and CPU time, possibly competing with the leader.  So adding parallel workers to this plan will actually make it slower.

This is a pretty clear example of a case where you just shouldn't use (or want to use) parallel query, but there are other examples that are less clear-cut.  If you encounter one of those cases, you might very reasonably want to convince the planner to try out a parallel plan.  After all, the planner can be wrong.  Maybe the plan which the planner thinks is cheapest does not use parallel query, but the plan that actually runs the fastest does use parallel query.  It can be very valuable to adjust planner settings to try out alternative plans in the situation, but setting force_parallel_mode is not the right way to do it, for a couple of reasons.

First, setting force_parallel_mode will only ever give you 1 parallel worker.  Second, setting force_parallel_mode creates a plan where the leader just sits around reading tuples from the worker while the worker does all of the real work.  In a regular parallel query plan, all of the processes cooperates, so the leader may help to run the parallel portion of the plan if it's not too busy reading tuples, and the workers also run the parallel portion of the plan.  In a plan created by setting force_parallel_mode, this doesn't happen.  There's actually a clue right in the plan itself.  It says "Single Copy: true".  What that means is that only a single copy of the plan beneath the Gather is permitted to be executed.  Therefore, the single worker is going to run the plan, and the leader is going to sit around, wait for tuples to show up, and then relay them to the client.  This is slower than just having the leader do all the work itself, which is what would have happened if you had not specified force_parallel_mode.

In short, setting force_parallel_mode is useless for the purpose for which most people seem to be trying to use it.  It will not make your queries run faster; it will slow them down.  The correct way to test out plans that use parallelism more aggressively is to reduce the planner's estimate of how much parallelism costs.  By default, parallel_setup_cost=1000 and parallel_tuple_cost=0.1.  If you want more of your query plans to run in parallel, reduce these values.  Let's see what happens if we try that:


rhaas=# set parallel_setup_cost = 10;
SET
rhaas=# set parallel_tuple_cost = 0.001;
SET
rhaas=# explain select * from pgbench_accounts;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Gather  (cost=10.00..86250.67 rows=4000000 width=97)
   Workers Planned: 2
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..82240.67 rows=1666667 width=97)
(3 rows)


As you can see, the results here are different than when I used force_parallel_mode.  First, the plan underneath the Gather is now a Parallel Seq Scan, not just a regular Seq Scan.  That means that the plan is safe for multiple workers to execute in parallel.  As a result, the Single Copy flag on the Gather has disappeared.  This means that all of the workers, as well as the leader, will be able to cooperate in executing the plan, so there is a chance that this plan will be faster, whereas when we set force_parallel_mode, there was no chance.  Finally, we now have 2 workers rather than just 1.

In fact, because this query is unsuitable for parallel query, neither approach speeds it up.  On my laptop, the first plan (non-parallel) takes 3.26 seconds, the second plan (force_parallel_mode) takes 3.49 seconds, and the third plan (real parallelism) takes 3.55 seconds.  Therefore the best option here happens to be what the planner chose originally: no parallelism.  Of course, results may vary in other cases, but the thing to remember is that real parallelism is what you want, and force_parallel_mode will not give it to you.  If you cannot get real parallelism even after reducing parallel_setup_cost and parallel_tuple_cost, you can also try adjusting min_parallel_table_scan_size and min_parallel_index_scan_size (on 9.6, there is only one setting, and it is called min_parallel_relation_size).  If you still can't get parallelism, it might be disabled, or it might not be supported for your query type.  See When Can Parallel Query Be Used? in the PostgreSQL documentation for more details.

So, if force_parallel_mode doesn't make queries run faster, what does it do, and when would anyone ever want to use it?  I designed force_parallel_mode not as a tool for speeding up queries, but as a tool for finding cases when parallel query doesn't give the right answers.  For example, suppose you do this:


rhaas=# create or replace function give10(int) returns void as $$update pgbench_accounts set abalance = abalance + 10 where aid = $1$$ language sql parallel safe;
CREATE FUNCTION
rhaas=# select give10(aid) from pgbench_accounts where aid = 10;
 give10 
--------

(1 row)

If you are familiar with the limitations of parallel query, you may see that there's a problem here.  I have marked the function I created as PARALLEL SAFE, but it does an UPDATE, and parallel query does not support UPDATE.  If the planner ever tries to use the give10(int) function inside a parallel query plan, it's going to fail.  However, in this simple test case, everything appears to work, because the SELECT query I've used as a test only fetches one row, and therefore the planner chooses an Index Scan and does not attempt to use parallel query.

This situation can be a sort of ticking time bomb.  It may seem to work for a while, but the planner might later switch to a different plan, and that new plan may use parallel query where the old one did not, and now things will start to fail.  In the worst case, this could result in a production outage.  If I test with force_parallel_mode, I will find the problem right away:


rhaas=# set force_parallel_mode = on;
SET
rhaas=# select give10(aid) from pgbench_accounts where aid = 10;
ERROR:  cannot execute UPDATE during a parallel operation
CONTEXT:  SQL function "give10" during startup
parallel worker

Setting force_parallel_mode caused the query planner to place a Single-Copy Gather node on top of the plan it would otherwise have used, which caused give10(int) to execute in the worker, rather than the leader.  Then it failed.  Now the problem is easy to fix:


rhaas=# alter function give10(int) parallel unsafe;
ALTER FUNCTION
rhaas=# set force_parallel_mode = on;
SET
rhaas=# select give10(aid) from pgbench_accounts where aid = 10;
 give10 
--------

(1 row)

Because the function is now (correctly) marked as parallel-unsafe, the planner won't use parallelism even with force_parallel_mode=on.  In other words, the value of testing with force_parallel_mode=on is that it may allow you to identify functions that are incorrectly marked as parallel-safe.  You can then adjust the markings on those functions and avoid unexpected failures later.

No comments:

Post a Comment