Guenther's law models two effects. First, in a concurrent system, some percentage of the work must be done serially rather than in parallel. For example, if we imagine a number of PostgreSQL backends all performing a workload where 5% of the work can't be parallelized, then, no matter how many processes we add, the overall throughput can never be more than 20 times the throughput of a single process. This is the parameter that is called alpha in the above link to Wikipedia, and sigma in the Percona white paper on this topic.
Second, the "law" also models the effect of cache coherency. As I understand it, this means that even in the absence of lock contention, operations that access shared state are going to become slower as the number of processors or threads or whatever increases, because the overhead of maintaining cache coherency is going to go up. This parameter is called beta in the above link to Wikipedia, and kappa in the Percona white paper.
I happen to have a graph that I recently made, showing how well PostgreSQL scales on a read-only workload consisting of lots and lots of little tiny queries. In the graph below, the blue line is PostgreSQL 9.1, the green line is PostgreSQL 9.2 just after I committed my patch to add a fast-path for relation locking, and the red line is a recent snapshot of the development tree.
Now, there are a couple of interesting things about this graph, aside from the obvious fact that the green line looks a lot better than the blue line, and the red line looks better than the green line. First, of course, both the green and red lines flatten off at 32 cores and gradually descend thereafter. Since these results were collected on a 32-core machine, this isn't surprising. The blue line peaks around 20 cores, then drops and levels off. Second, if you look, you can see that the green line is actually descending reasonably quickly after hitting its peak, whereas the red line - and, even more, the blue line - decline more slowly.
Something that's a little harder to see on this graph is that even at 1 client, performance on the latest 9.2devel sources is about 2.9% better than on 9.1, and at 4 clients, the difference grows to 13%. Because of the scale of the graph, these improvements at lower concurrencies are are hard to see, but they're nothing to sneeze at. I'm wondering whether some of the single-client performance improvement may be related to Tom Lane's recent rewrite of the plan cache, but I haven't had a chance to test that theory yet.
Anyway, after listening to Baron's talk, I was curious how well or poorly this day would fit the Universal Scalability Law, I got to wondering what would happen if we fed this data into that model. As it turns out, Baron has written a tool called "usl" which does just that. To avoid confusing the tool, I just fed it the data points for N=32, since what's going on above 32 clients is a completely different phenomenon that the tool, not knowing we're dealing with a 32-core server, won't be able to cope with. For PostgreSQL 9.1, the curve fits pretty well:
But something weird happens when I feed in either of the other two data sets. Here are the results with almost-current PostgreSQL 9.2 (the red line from the first graph, above):
What's going on here? Clearly, peak throughput is not at 0 clients, so the tool is confused. But if you look at the graph, you might start to get confused, too: at the higher client counts, performance appears to be increasing more than linearly as we add clients. And surely the cache coherency overhead can't be negative. But in fact, the underlying data shows the same super-linear scaling -- the "% scale" columns in the following table show how the performance compares to a linear multiple of the single-client performance.
Clients | PG 9.1 | PG 9.1 Scale % | PG 9.2 Fast Locks | PG 9.2 Fast Locks Scale % | PG 9.2 Current | PG 9.2 Current % Scale |
1 | 4373.300345 | 100.00 | 4439.850447 | 100.00 | 4503.456893 | 100.00 |
4 | 15582.906721 | 89.08 | 17111.286051 | 96.35 | 17630.978751 | 97.87 |
8 | 27353.511970 | 78.18 | 33305.862725 | 93.77 | 34566.946364 | 95.95 |
12 | 37502.231910 | 71.46 | 47466.026409 | 89.09 | 49015.229229 | 90.70 |
16 | 45365.164245 | 64.83 | 61403.716549 | 86.44 | 63773.262719 | 88.51 |
20 | 46926.751545 | 53.65 | 73229.068052 | 82.47 | 79503.733257 | 88.27 |
24 | 42854.194540 | 40.83 | 97529.101266 | 91.53 | 105359.667045 | 97.48 |
28 | 39835.953877 | 32.53 | 143119.867343 | 115.13 | 153593.427863 | 121.81 |
32 | 38862.979179 | 27.77 | 183640.425642 | 129.26 | 223763.180683 | 155.27 |
36 | 38303.048286 | 24.33 | 186552.784323 | 116.72 | 220246.876666 | 135.85 |
40 | 37881.287214 | 21.65 | 187370.087094 | 105.50 | 219766.959422 | 122.00 |
44 | 37647.482071 | 19.56 | 188295.567647 | 96.39 | 217214.242070 | 109.62 |
48 | 37379.048176 | 17.81 | 184799.330961 | 86.71 | 221445.000980 | 102.44 |
52 | 37421.439302 | 16.46 | 182925.811356 | 79.23 | 222896.505365 | 95.18 |
56 | 37306.105593 | 15.23 | 181790.112309 | 73.12 | 218147.067243 | 86.50 |
60 | 37235.200604 | 14.19 | 176109.852522 | 66.11 | 218513.111556 | 80.87 |
64 | 37220.375141 | 13.30 | 176334.823058 | 62.06 | 216918.748924 | 75.26 |
68 | 37045.137424 | 12.46 | 171278.400935 | 56.73 | 215065.702108 | 70.23 |
72 | 36793.404693 | 11.68 | 168922.211243 | 52.84 | 213124.312388 | 65.73 |
76 | 36998.599400 | 11.13 | 165651.641194 | 49.09 | 215062.957555 | 62.84 |
80 | 36734.524626 | 10.50 | 164238.547823 | 46.24 | 213838.588913 | 59.35 |
The numbers in red are the dramatically odd cases: at 32 clients, the latest code is more than 50% faster than what you'd expect given perfect linear scalability. Your first intuition might be to suspect that these results are a fluke, but I've seen similar numbers in many other test runs, so I think the effect is real, even though I have no idea what causes it.
Assuming all these processes are acessing mostly the same memory for code as well as data, and also assuming they use more memory than the L1 cache can hold they might profit from mutual "cache prefetching".
ReplyDeleteIt is absolutely amazing!
ReplyDeleteCan't wait 9.2 final to see it in real action. Thnx for your work.
This is without a patched lseek, right? If so you should get a bit more from that once the patch lands in Linux.
ReplyDeleteWe will try it with OpenERP as soon it's released. I'm curious to see how much faster it is with OpenERP 6.1
ReplyDeleteThe disadvantage of the tool I created that uses gnuplot is... it uses gnuplot, which can't place constraints on parameters. It makes no sense for there to be negative seriality or coherence, but I can't tell gnuplot not to go negative. The solution is to use R instead.
ReplyDeleteIt is possible for systems to have better than linear scalability if there is an effect of "economies of scale," that is a resource that is more efficient when shared than when used singly. I think this is relatively rare. The USL does not model this. I think this is a shortcoming of the USL model (all models are wrong, some models are useful).
You can find another example of this here: http://mikaelronstrom.blogspot.com/2011/05/better-than-linear-scaling-is-possible.html
I think the USL needs another parameter to reflect what I am calling "economies of scale."
Maybe I'm stating the obvious, but I've seen similar effects due to power management features of recent hardware and kernels. I'm not sure about your configuration, so this may not apply.
ReplyDeleteThe biggest suspect is CPU frequency scaling; whenever the kernel detects that a CPU isn't active enough, it downclocks it, but there's always some lag before the clock is raised again. When you reach a stage where all CPUs in your system are mostly busy, the kernel stops downclocking and everything suddenly goes faster. Fortunately it's very easy to turn this off.
Besides that, there are more subtle power management techniques -- PCIe controllers and RAM modules include PM features too these days.
I've seen the superlinear scaling as well.
ReplyDeleteI attribute it to CPU affinity. When you have 1 pgbench thread and 1 backend process per CPU, the kernel migrates them so each thread is on the same CPU as the backend it drives.
But when threads+backend < #CPU, the kernel tries to give each thing its own CPU, breaking up the driver/driven pairing.
Cheers,
Jeff Janes
Robert, Baron, et al.,
ReplyDeleteCome and take a gander at PostgreSQL Scalability Analysis Deconstructed.
Comments welcomed.
--njg
I know this could be a dumb question. But I heard somewhere that MySQL does not scale well with cores. If I need to put up a sever with multiple databases(different web sites running django, LAMP, Rails etc opting for a dedicated database server), and if I have a 24 core (2x 12 core Opteron) for database server, will Postgres scale betetr, and if so by what magnitude
ReplyDeleteDear Mr. Robert, please look at the papers "A more robust regression approach to estimate the parameters of super serial scalability law for noisy data" and "Mythbuster for Guerrillas" presented at CMG 2012.
ReplyDelete-Jayanta Choudhury