Robert Haas: Scalability, in Graphical Form, Analyzed

I'm at Surge, this week, where I just listened to Baron Schwartz give a talk about scalability and performance. As usual, Baron was careful to distinguish between performance (which is how fast it is) and scalability (which is how much faster you can make it by adding more resources). One of the things Baron talked about was Neil Guenther's Universal Scalability Law, which attempts to model the behavior of complex systems (such as database systems) as you add threads, or nodes, or users.

Guenther's law models two effects. First, in a concurrent system, some percentage of the work must be done serially rather than in parallel. For example, if we imagine a number of PostgreSQL backends all performing a workload where 5% of the work can't be parallelized, then, no matter how many processes we add, the overall throughput can never be more than 20 times the throughput of a single process. This is the parameter that is called alpha in the above link to Wikipedia, and sigma in the Percona white paper on this topic.

Second, the "law" also models the effect of cache coherency. As I understand it, this means that even in the absence of lock contention, operations that access shared state are going to become slower as the number of processors or threads or whatever increases, because the overhead of maintaining cache coherency is going to go up. This parameter is called beta in the above link to Wikipedia, and kappa in the Percona white paper.

I happen to have a graph that I recently made, showing how well PostgreSQL scales on a read-only workload consisting of lots and lots of little tiny queries. In the graph below, the blue line is PostgreSQL 9.1, the green line is PostgreSQL 9.2 just after I committed my patch to add a fast-path for relation locking, and the red line is a recent snapshot of the development tree.

Now, there are a couple of interesting things about this graph, aside from the obvious fact that the green line looks a lot better than the blue line, and the red line looks better than the green line. First, of course, both the green and red lines flatten off at 32 cores and gradually descend thereafter. Since these results were collected on a 32-core machine, this isn't surprising. The blue line peaks around 20 cores, then drops and levels off. Second, if you look, you can see that the green line is actually descending reasonably quickly after hitting its peak, whereas the red line - and, even more, the blue line - decline more slowly.

Something that's a little harder to see on this graph is that even at 1 client, performance on the latest 9.2devel sources is about 2.9% better than on 9.1, and at 4 clients, the difference grows to 13%. Because of the scale of the graph, these improvements at lower concurrencies are are hard to see, but they're nothing to sneeze at. I'm wondering whether some of the single-client performance improvement may be related to Tom Lane's recent rewrite of the plan cache, but I haven't had a chance to test that theory yet.

Anyway, after listening to Baron's talk, I was curious how well or poorly this day would fit the Universal Scalability Law, I got to wondering what would happen if we fed this data into that model. As it turns out, Baron has written a tool called "usl" which does just that. To avoid confusing the tool, I just fed it the data points for N=32, since what's going on above 32 clients is a completely different phenomenon that the tool, not knowing we're dealing with a 32-core server, won't be able to cope with. For PostgreSQL 9.1, the curve fits pretty well:

But something weird happens when I feed in either of the other two data sets. Here are the results with almost-current PostgreSQL 9.2 (the red line from the first graph, above):

What's going on here? Clearly, peak throughput is not at 0 clients, so the tool is confused. But if you look at the graph, you might start to get confused, too: at the higher client counts, performance appears to be increasing more than linearly as we add clients. And surely the cache coherency overhead can't be negative. But in fact, the underlying data shows the same super-linear scaling -- the "% scale" columns in the following table show how the performance compares to a linear multiple of the single-client performance.

Clients	PG 9.1	PG 9.1 Scale %	PG 9.2 Fast Locks	PG 9.2 Fast Locks Scale %	PG 9.2 Current	PG 9.2 Current % Scale
1	4373.300345	100.00	4439.850447	100.00	4503.456893	100.00
4	15582.906721	89.08	17111.286051	96.35	17630.978751	97.87
8	27353.511970	78.18	33305.862725	93.77	34566.946364	95.95
12	37502.231910	71.46	47466.026409	89.09	49015.229229	90.70
16	45365.164245	64.83	61403.716549	86.44	63773.262719	88.51
20	46926.751545	53.65	73229.068052	82.47	79503.733257	88.27
24	42854.194540	40.83	97529.101266	91.53	105359.667045	97.48
28	39835.953877	32.53	143119.867343	115.13	153593.427863	121.81
32	38862.979179	27.77	183640.425642	129.26	223763.180683	155.27
36	38303.048286	24.33	186552.784323	116.72	220246.876666	135.85
40	37881.287214	21.65	187370.087094	105.50	219766.959422	122.00
44	37647.482071	19.56	188295.567647	96.39	217214.242070	109.62
48	37379.048176	17.81	184799.330961	86.71	221445.000980	102.44
52	37421.439302	16.46	182925.811356	79.23	222896.505365	95.18
56	37306.105593	15.23	181790.112309	73.12	218147.067243	86.50
60	37235.200604	14.19	176109.852522	66.11	218513.111556	80.87
64	37220.375141	13.30	176334.823058	62.06	216918.748924	75.26
68	37045.137424	12.46	171278.400935	56.73	215065.702108	70.23
72	36793.404693	11.68	168922.211243	52.84	213124.312388	65.73
76	36998.599400	11.13	165651.641194	49.09	215062.957555	62.84
80	36734.524626	10.50	164238.547823	46.24	213838.588913	59.35

The numbers in red are the dramatically odd cases: at 32 clients, the latest code is more than 50% faster than what you'd expect given perfect linear scalability. Your first intuition might be to suspect that these results are a fluke, but I've seen similar numbers in many other test runs, so I think the effect is real, even though I have no idea what causes it.

10 comments:

Rupert Kittinger-SereinigSeptember 30, 2011 4:02 PM
Assuming all these processes are acessing mostly the same memory for code as well as data, and also assuming they use more memory than the L1 cache can hold they might profit from mutual "cache prefetching".
WildRAiDOctober 01, 2011 4:50 AM
It is absolutely amazing!
Can't wait 9.2 final to see it in real action. Thnx for your work.
AnonymousOctober 01, 2011 7:58 PM
This is without a patched lseek, right? If so you should get a bit more from that once the patch lands in Linux.
Frontware InternationalOctober 02, 2011 1:22 AM
We will try it with OpenERP as soon it's released. I'm curious to see how much faster it is with OpenERP 6.1
BaronOctober 06, 2011 10:16 PM
The disadvantage of the tool I created that uses gnuplot is... it uses gnuplot, which can't place constraints on parameters. It makes no sense for there to be negative seriality or coherence, but I can't tell gnuplot not to go negative. The solution is to use R instead.

It is possible for systems to have better than linear scalability if there is an effect of "economies of scale," that is a resource that is more efficient when shared than when used singly. I think this is relatively rare. The USL does not model this. I think this is a shortcoming of the USL model (all models are wrong, some models are useful).

You can find another example of this here: http://mikaelronstrom.blogspot.com/2011/05/better-than-linear-scaling-is-possible.html

I think the USL needs another parameter to reflect what I am calling "economies of scale."
intgrOctober 23, 2011 8:35 PM
Maybe I'm stating the obvious, but I've seen similar effects due to power management features of recent hardware and kernels. I'm not sure about your configuration, so this may not apply.

The biggest suspect is CPU frequency scaling; whenever the kernel detects that a CPU isn't active enough, it downclocks it, but there's always some lag before the clock is raised again. When you reach a stage where all CPUs in your system are mostly busy, the kernel stops downclocking and everything suddenly goes faster. Fortunately it's very easy to turn this off.

Besides that, there are more subtle power management techniques -- PCIe controllers and RAM modules include PM features too these days.
Jeff JanesJanuary 16, 2012 4:00 AM
I've seen the superlinear scaling as well.

I attribute it to CPU affinity. When you have 1 pgbench thread and 1 backend process per CPU, the kernel migrates them so each thread is on the same CPU as the backend it drives.

But when threads+backend < #CPU, the kernel tries to give each thing its own CPU, breaking up the driver/driven pairing.

Cheers,

Jeff Janes
Neil GuntherApril 11, 2012 11:55 AM
Robert, Baron, et al.,

Come and take a gander at PostgreSQL Scalability Analysis Deconstructed.

Comments welcomed.

--njg
Ramdas SApril 26, 2012 2:26 AM
I know this could be a dumb question. But I heard somewhere that MySQL does not scale well with cores. If I need to put up a sever with multiple databases(different web sites running django, LAMP, Rails etc opting for a dedicated database server), and if I have a 24 core (2x 12 core Opteron) for database server, will Postgres scale betetr, and if so by what magnitude
AnonymousDecember 19, 2012 1:29 PM
Dear Mr. Robert, please look at the papers "A more robust regression approach to estimate the parameters of super serial scalability law for noisy data" and "Mythbuster for Guerrillas" presented at CMG 2012.

-Jayanta Choudhury

Friday, September 30, 2011

Scalability, in Graphical Form, Analyzed

10 comments: