I decided to do a little more research on the performance of server-side backup compression, which will be a new feature in PostgreSQL 15 unless, for some reason, the changes need to be reverted prior to release time. The network link I used for my previous testing was, as I mentioned, rather slow, and handicapped by both a VPN link and an SSH tunnel. Furthermore, I was testing using pgbench data, which is extremely compressible. In addition, at the time I did those tests, we had added support for LZ4 compression, but we had not yet added support for Zstandard compression. Now, however, we not only have Zstandard as an option, but it is possible to use the library's multi-threading capabilities. So, I wanted to find out how things would work out on a faster network link, with a better test data set, and with all the compression algorithms that we now have available.
To try to figure that out, I downloaded the UK land registry data which is mentioned in https://wiki.postgresql.org/wiki/Sample_Databases and loaded that up into PostgreSQL database, built from a recent commit off the master branch that will eventually become v15. The resulting database is 3.8GB. I then tried a few backups between this machines and another machine located on the same EDB internal subnet. Both machines report having 10 Gigabit Ethernet, and iperf reports bandwidth of 8.73 GB/s between the machines.
I tried backups with both with both -Ft (tar format) and -Fp (plain format), in each case testing out various forms of server-side compression. When the backup is taken in tar format, pg_basebackup is basically just writing the server-generated files to disk. When it's taken in plain format, pg_basebackup decompresses and extracts the archive. Here are the results:
Basically, the gains are attributable to pg_basebackup needing to do less work. When the backup size drops from 3.8GB to 1.3GB, pg_basebackup has 2.5 fewer gigabytes of data that it needs to receive from the network, and 2.5 fewer gigabytes of data that it needs to write out to disk. The network is fast enough that the cost of transmitting 2.5 gigabytes of data isn't a real issue, but the kernel still has to copy all that data received from the network into PostgreSQL's address space, and then it must turn around and copy it from PostgreSQL's address space into the kernel buffer cache so that it can be written out to disk, and finally that data has to be actually written to disk. Both the extra copying and the actual disk writes are costly.