I have been working with my colleagues Tushar Ahuja, Jeevan Ladhe, and Dipesh Pandit to make some improvements to pg_basebackup for version 15. A lot of that work has felt a bit like boring but necessary refactoring, and it's easy to find yourself wondering whether it will really do anybody any good. I was feeling optimistic after today's commits, so I decide to give it a try.
I logged into a performance testing machine provided by my employer, EDB, and created a pgbench database there, scale factor 100. Then I tried to take a base backup from my EDB laptop. For security reasons, this required that I connect via a VPN and also use an ssh tunnel. With that setup in place, I was able to take a backup:
[rhaas pgsql]$ time pg_basebackup --compress none -D h1 -cfast -Xfetch -Ft -p 51443 -h ::1 -U robert.haas
That's not great. Let's see what happens if I enable server-side compression using LZ4:
[rhaas pgsql]$ time pg_basebackup --compress server-lz4 -D h2 -cfast -Xfetch -Ft -p 51443 -h ::1 -U robert.haas
Wow. That's a lot better. Let's see what happened to the backup size:
[rhaas pgsql]$ ls -lh h1 h2
-rw------- 1 rhaas staff 177K Feb 11 11:25 backup_manifest
-rw------- 1 rhaas staff 1.5G Feb 11 11:25 base.tar
-rw------- 1 rhaas staff 177K Feb 11 11:27 backup_manifest
-rw------- 1 rhaas staff 169M Feb 11 11:27 base.tar.lz4
Admittedly, this is a very sympathetic test case. The data set created by pgbench is highly repetitive, so it compresses well, and my network bandwidth in this case was pretty limited, since my home Internet connection is not especially fast, and I had to tunnel through ssh on top of it. So the gains that you see if you try this may be less, or none at all. On the other hand, I don't think this is an unrealistic test case. Backing up databases over a slow WAN link is a thing that people do want to do, and some of those databases are highly compressible. Furthermore, even if you got half the gain that I saw here, it would still be pretty worthwhile.
Now you might complain that you don't like the tar backup format, and would really prefer a plain backup. So you want to do something like this:
[rhaas pgsql]$ time pg_basebackup -D e1 -cfast -Xfetch -Fp -p 51443 -h ::1 -U robert.haas
It turns out we have got that covered, too. You can use LZ4 compression on the server side and then decompress and extract the archive on the client side:
[rhaas pgsql]$ time pg_basebackup -D e2 --compress server-lz4 -cfast -Xfetch -Fp -p 51443 -h ::1 -U robert.haas
The speedup is less here, because extracting the archive takes some work that is not reduced by compression, and the decompression itself also takes time. However, it's still quite a nice improvement. You might wonder, though, whether this feature is just cheating. Did we really end up with a proper backup?
[rhaas pgsql]$ pg_verifybackup e2
backup successfully verified
Looks like we did. Nice!
All the usual caveats apply here: these patches are currently committed to PostgreSQL, but there is no guarantee that they will appear in PostgreSQL 15. It is possible that they might turn out to have bugs or design problems which could result in them being reverted. If, however, that doesn't happen, then I expect these changes to help out a lot of people when PostgreSQL 15 becomes generally available.
If you're interested in learning more about the work that I and my colleagues have been doing in this area, my colleague Hettie Dombrovskaya is hosting a meetup next week, at which I will be speaking. You can find all the details here:
Congratulations again! PostgreSQL community is making PostgreSQL to be the ultimate DATASTORE for most use cases (GIS, Relational, Document/JSON, Graph, Bitemporal, Analytical, etc.) Well done people!ReplyDelete
Great work Robert, Tushar, Jeevan and Dipesh!ReplyDelete
Above comment from Tom Kincaid (firstname.lastname@example.org).ReplyDelete