There were some interesting comments on my previous blog post on logical replication, which have inspired me to write a bit more about this topic. Exactly how should this actually work? Every time a tuple is inserted, updated, or deleted, we already write to the transaction log the entire contents of any new tuple, and the physical position of any old tuple. It would be nice to piggyback somehow on the work that's already being done there, but it's hard to see exactly how. I see three main problems.
First, we need to know more about any old tuple than its physical position. Specifically, we probably want to know its primary key. In the case of an update, this is only a problem for the current representation if the primary key has been changed; in the case of a delete, it's always a problem. It probably wouldn't be all that difficult or expensive to have a mode where we write the old tuple, or the relevant columns from it, to the log as well. So this seems manageable.
Second, unless we want to be able to replicate only to a cluster that is guaranteed to be exactly binary-compatible with the one we're replicating from, and with identical type definitions, we're going to need to serialize the tuple data to text. This is hard. You can't make sense of the tuple data on its own - you need the column definitions for the table it came from. The data isn't self-identifying: a four-byte integer is represented by four bytes, period. Two two-byte integers could look just the same; and it could equally well be the beginning of a variable-length value. This wouldn't be a problem if you were willing to lock out all concurrent schema changes on tables with replication enabled, but otherwise, you need some way to make sure that each tuple gets decoded using the tuple descriptor that was in effect at the time the WAL log entry was made, which might not be the same as the current tuple descriptor.
Third, large values are going to get pushed out to the TOAST table, so what we'll actually see in the WAL log is a series of insertions to the TOAST table and an insert into the parent table containing a TOAST pointer. Some kind of logic to reassemble all of those physical inserts into a single logical operation would be needed.
The alternative to all of this is to give up on reconstructing the tuples from the WAL data, and instead have a separate replication log that gets updated in parallel with WAL. That seems architecturally a lot simpler, but I think the performance will be quite a lot worse.