Wednesday, September 25, 2019

Synchronous Replication is a Trap

Almost ten years ago, I wrote a blog post -- cautiously titled What Kind of Replication Do You Need? -- in which I suggested that the answer was probably "asynchronous." At that time, synchronous replication was merely a proposed feature that did not exist in any official release of PostgreSQL; now, it's present in all supported versions and has benefited from several rounds of enhancements. Demand for this feature seems to be high, and there are numerous blog posts about it available (EnterpriseDB, Cybertec, Ashnik, OpsDash), but in my opinion, there is not nearly enough skepticism about the intrinsic value of the technology. I think that a lot of people are using this technology and getting little or no benefit out of it, and some are actively hurting themselves.

If you're thinking about deploying synchronous replication -- or really any technology -- you should start by thinking carefully about exactly what problem you are hoping to solve. Most people who are thinking about synchronous replication seem to be worried about data durability; that is, they want to minimize the chances that a transaction will be lost in the event of a temporary or permanent server loss. This is where I think most people hope for more than the feature is really capable of delivering; more on that below. However, some people are concerned with data consistency; that is, they want to make sure that if they update data on the master and then immediately query the data on a slave, the answer they get is guaranteed to reflect the update. At least one person with whom I spoke was concerned with replication lag; in that environment, the master could do more work than the standby could replay, and synchronous replication kept the two from diverging arbitrarily far from each other.

I have few reservations about the use of synchronous replication for data consistency.  For this to work, you need to configure the master with synchronous_commit = 'remote_apply' and set synchronous_standby_names to a value that will cause it to wait for all of the standbys to respond to every commit (see the documentation for details). You'll need PostgreSQL 9.6 or higher for this to work. Also, don't forget testing and monitoring. Remember that if one of your standbys goes down, commits will stall, and you'll need to update synchronous_standby_names to remove the failed standby (and reduce the number of servers for which you are to wait by one).  You can reverse those changes once the standby is back online and caught up with the master. Perhaps in the future we'll be able to do this sort of thing with less manual reconfiguration, but I think this kind of solution is already workable for many people. If the performance hit from enabling synchronous replication is acceptable to you, and if the benefit of not having to worry about the data on the standby being slightly stale is useful to you, this kind of configuration is definitely worth considering.

I also don't think it's a big problem to use synchronous replication to control replication lag. It doesn't seem like an ideal tool, because if your goal is to prevent the standby from getting excessively far behind the master, you would probably be willing to accept a certain amount of lag when a burst of activity occurs on the master, as long as it doesn't continue for too long. Synchronous replication will not give you that kind of behavior.  It will wait at every commit for that commit to be received, written, or applied on the remote side (depending on the value you choose for synchronous_standby_names; see documentation link above). Perhaps someday we'll have a feature that slows down the master only when lag exceeds some threshold; that would be a nicer solution. In the meantime, using synchronous replication is a reasonable stopgap.

Where I think a lot of people go wrong is when they think about using synchronous replication for data durability. Reliable systems that don't lose data are built out of constituent parts that are unreliable and do lose data; durability and reliability are properties of the whole system, not a single component. When we say that a software solution such as synchronous replication improves data durability, what we really mean is that it helps you avoid the situation where you think the data is reliably persisted but it really isn't. After all, neither synchronous replication nor any other software system can prevent a disk from failing or a network connection from being severed; they can only change the way the system responds when such an event occurs.

The baseline expectation of a software system that talks to PostgreSQL is - or ought to be - that when you send a COMMIT command and the database confirms that the command was executed successfully, the changes made by that transaction have been persisted. In the default configuration, "persisted" means "persisted to the local disk." If you set synchronous_commit = off, you weaken that guarantee to "persisted in memory, and we'll get it on disk as soon as we can." If you set synchronous_standby_names, you strengthen it to "persisted to the local disk and also some other server's disk" - or perhaps multiple servers, depending on the value you configure. But the key thing here is that any configuration changes that you make in this area only affect the behavior of the COMMIT statement, and therefore they only have any value to the extent that your application pays attention to what happens when it runs COMMIT.

To make this clearer, let's take an example. Suppose there's a human being - I'll call her Alice - who sits at a desk somewhere. People give orders (which are critical data!) to Alice, and she enters them into a web application, which stores the data into a PostgreSQL database. We have basically three components here: Alice, the web application, and the database. Any of them can fail, and nothing we do can prevent them from failing. The database server can go down due to a hardware or software fault; similarly for the web server; and Alice can get sick. We can reduce the probability of such failures by techniques such as RAID 10 and vaccinations, but we can't eliminate it entirely. What we can try to do is create a system that copes with such failures without losing any orders.

If there is a transient or permanent failure of Alice, it's probably safe to say that no orders will be lost. Maybe the people who normally give orders to Alice will leave them in a pile on her desk, or maybe they'll notice that Alice is out and come back later in the hopes of handing them to her directly once she's back in the office. Either way, the orders will eventually get entered. There are potential failure modes here, such as the building burning down and taking the unentered order papers with it, and there are things that can be done to mitigate such risks, but that's outside the scope of this blog post.

A transient or permanent of failure of the web server is a more interesting case. A total failure of the web server is unlikely to cause any data loss, because Alice will be aware that the failure has occurred. If she goes to the web application where she spends most of her time and it fails to load, she'll hold on to any pending order papers at her desk until the application comes back up, and then deal with the backlog. Even if she's already got the web page open loaded, she'll certainly notice if she hits the button to save the latest order and gets an error page back from the browser. Really, the only way things can go wrong here is if the web application experience some kind of partial failure wherein it fails to save the order to the database but doesn't make it clear to Alice that something has gone wrong. In that case, Alice might discard the order paper on the erroneous belief that the data has been saved.  But otherwise, we should be OK.

Notice that the key here is good error reporting: as long as Alice knows whether or not a particular transaction succeeded or failed, she'll know what to do.  Even if she's uncertain, that's OK: she can go search the database for the order that she just tried to enter, and see if it's there.  If it's not, she can enter it again before discarding the order paper.

Now, let's think about what happens if the database fails catastrophically, such that the entire server is lost. Obviously, if we don't have a replica, we will have lost data. If we do have a replica, we will probably have the data on the replica, and everything will be OK. However, if some orders were entered on the master but not yet replicated to the standby at the time of the failure, they might no longer exist on the new master. If Alice still has the order papers and hears about the failure, we're still OK: she can just reenter them. However, if she destroys every order paper as soon as the web application confirms that it's been saved, then we've got a problem.

This is where synchronous replication can help. If the database administrator enables synchronous replication, then the database server won't acknowledge a COMMIT from the web application until the commit has been replicated to the standby. If the application is well-designed, it won't tell Alice that the order is saved until the database acknowledges the COMMIT. Therefore, when Alice gets a message saying that the order has been saved, it's guaranteed to be saved on both the master and the standby, and she can destroy the original order paper with no risk of data loss - unless both the master and standby fail, but if you're worried about that scenario, you should have multiple standbys.  So, we seem to have constructed a pretty reliable system here, and synchronous replication is an important part of what makes it reliable.

Notice, however, the critical role of the application here. If the application tells Alice that the order is saved before the commit is acknowledged, then the whole thing falls apart. If Alice doesn't pay attention to whether or not the order was confirmed as saved, then the whole thing falls apart. She might, for example, close frozen browser window and fail to recheck whether that order went through after reopening the application. Every step of the pipeline has to be diligent about reporting failures back to earlier stages, and there has to be a retry mechanism if things do fail, or if they may have failed. Remember, there's nothing at all we can do to remove the risk that the database server will fail, or that the web application will fail, or even that Alice will fail. What we can do is make sure that if the database fails, the web application knows about it; if the web application fails, Alice knows about it; and if Alice fails, the people submitting the orders know about it. That's how we create a reliable system.

So, the "trap" of synchronous replication is really that you might focus on a particular database feature and fail to see the whole picture. It's a useful tool that can supply a valuable guarantee for applications that are built carefully and need it, but a lot of applications probably don't report errors reliably enough, or retry transactions carefully enough, to get any benefit.  If you have an application that's not careful about such things, turning on synchronous replication may make you feel better about the possibility of data loss, but it won't actually do much to prevent you from losing data.

9 comments:

  1. This is really good read on understanding how Application,Database and Disk mirroring can play a vital role on saving data from loss.

    ReplyDelete
  2. Totally misleading title. Every feature is good when you use appropriately

    ReplyDelete
  3. I also think very misleading title.

    ReplyDelete
  4. Thanks a lot Robert for the write up. It's very helpful.

    ReplyDelete
  5. Agreed with other commenters. Great article but the title is misleading!

    ReplyDelete
  6. Good write up. In these days companies are asking to be a full stack developer ignoring in depth database or middleware knowledge required.
    What about using an replication through distributed message bus like kafka where master will acknowledge a commit when it writes transaction to the message bus but not to standby yet. Standby will pickup from each individual's last read. There will be a lag but much less chance of data loss.

    ReplyDelete
  7. Replies
    1. Well, I guess I should have chosen a different title, but I'm glad that the comments are all complaining about my choice of title rather than the substance...

      Delete