Wednesday, July 07, 2010

Distributed Serialization Anomalies

One of the more difficult responsibilities of a database is to provide you with the illusion that transactions on the system are executed sequentially, one after another, while in fact allowing as much parallelism as possible. PostgreSQL's MVCC implementation does this using "snapshots": each statement (or, if you choose the serializable isolation level, each transaction), upon first access to the database, records which transactions have committed as of that moment, and everything it does afterwards will see the effect of those transactions, but not any transactions committed later. (There are some exceptions to this rule when using READ COMMITTED mode with INSERT, UPDATE, or DELETE statements.)

This produces, more or less, the illusion that SQL statements execute sequentially, with each one completing its work before the next one begins. This illusion is extremely important and valuable in many real-world applications. For example, if you transfer money from your checking account to your savings account, a banking application might insert two new rows into the "banking_transactions" table: one to show the debit from checking, and another to show the credit to savings. It wouldn't be good if some other query saw just one of these two new rows: it would look as if the money disappeared from checking but did not appear in savings, or perhaps as if it had appeared in savings without disappearing from checking. You'd be unhappy about the first scenario, and the bank would be unhappy about the second one. This type of scenario is called a serialization anomaly, and databases are responsible for preventing them. In this case, it's pretty easy to make sure this problem can't happen: just do both inserts within a single transaction, and then commit it.

Things get a little trickier when there's more than one database involved. Suppose that I'm moving money from my account (at one branch) to my friend Magnus's account (at a different branch of the same bank). As before, we must make two transaction entries: one showing the debit to my account, and the other showing the credit to his account. We can start transactions on both nodes and do the inserts, but it's not possible to commit both transactions at the very same instant: there could always be a crash after one transaction commits, but before the other one commits.

We can work around this problem to some extent using a protocol called two-phase commit: we'll issue a "PREPARE TRANSACTION" command in both transactions, which should be enough to guarantee that a subsequent "COMMIT PREPARED" command, even after an intervening crash, has no chance of failure. So, we start a transaction on each database, do an insert on each database, prepare both transactions, and then commit both transactions. If there's a crash (or loss of connectivity) after either transaction is prepared but before both transactions are committed, we can still get things back to a consistent state once things are back up again. How? We look to see if either transaction committed; if so, we commit the other one. If not, we see whether both transactions were succesfully prepared; if so, we can commit or abort both; if not, we must abort both.

This solves the problem of making sure that no money can be permanently lost (or created), but there will still be a period of time during which we can see inconsistent views of the system as a whole. Imagine that the bank auditor comes along and runs a report across all bank branches adding up the bank's assets and liabilities. It's possible that he'll query one of the two databases involved in our hypothetical funds transfer before the transaction commits on that node, but by the time he visits the other one, it's committed - therefore he'll see the transferred funds either in both accounts, or in neither one, depending on the order in which he hits the different branches. This is a distributed serialization anomaly.

Distributed serialization anomalies are much harder to avoid than regular serialization anomalies (which are a hard problem all by themselves). One method - which is used by Postgres-XC - is to have a single authority (which Postgres-XC calls a global transaction manager) which hands out snapshots and transaction IDs across all nodes in the cluster; regrettably, there is a potential for this to become a bottleneck, or a single point of failure (see Postgres-XC_Write-Scalable_Cluster.pptx, slides 10 and following).

Unfortunately, there may not be many good alternatives. There is a technology called commitment ordering which seems to have a long paper trail[1] in the academic literature, and which has been studied in relation to MVCC. The good news is that commitment ordering does not require a global coordinator of any kind; each node operates independently and does not even need to know the identities of the other nodes, or even how many exist. It requires no additional communication of any kind. The bad news is that it operates by aborting potentially problematic transactions, and it might end up aborting quite a lot of them. The rule is simply that the serialization order must match the commit order; so if transaction A reads x and writes y, transaction B reads y; and then transaction A commits, the system will abort B (because there could be a read-write dependency cycle between A and B involving another database).

Another alternative is to build up a commit-order dependency graph that spans all the databases involved in the transaction. That is, we imagine a graph with each unaborted transaction as a vertex. If A reads or updates a row and B subsequently updates it, we add an edge from A to B. If A updates a row and B subsequently reads the updated version (or a later version), we also add an edge from A to B. If, at any time, adding an edge to the graph would create a cycle, we abort one of the constituent transactions. Kevin Grittner and Emmanuel Cecchet pointed out a paper by Michael Cahill on this topic[2]; one of the advantages of this approach is that it is possible to prevent all serialization anomalies, which our current approach does not. Kevin and Dan Ports have proposed a patch for 9.1 which would implement true serializability for a single PostgreSQL database, but it's not clear that this would scale well to a distributed system.

[1] e.g. The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource-Managers, Yoav Raz, 1990 [PDF].
[2] Serializable Isolation for Snapshot Databases, Michael J. Cahill, Uwe Röhm, Alan D. Fekete, 2006 [PDF].


  1. Veikum & Vossen, chaps. 18 and 19 has all the detail, and maths, one could want.

  2. Just to have true serializability in an ordinary postgres db would be really nice, otherwise you always have to consider concurrent transactions and the possibility of serialization anomalies.
    Guaranteeing a correct execution of concurrent transactions would be nearly impossible (without taking full table locks) when there are complicated transactions and maybe dynamic sql queries in a lower isolation level like snapshot isolation which is available now.

    Dan S

  3. You cannot get better performance than what Commitment ordering provides in a distributed environment. If you try to coordinate transactions' precedence order to avoid aborts due to serializability violation, you get unacceptable performance hit. Thus you should not coordinate (by any global graph or entity, or timestamps, or direct communication). Thus Commitment ordering achieves best performance, with the unavoidable price of aborts, as common in all 2PL databases (all commercial support it). The number of aborts is typically small, and all database systems live with it well.

    Weikum & Vossen give pure nonsense about Commitment ordering and distributed transaction management.

  4. Transactions are aborted for serializability violation only because of the order their operations are scheduled. Commitment ordering (CO) is the only technique that is completely indifferent to such order, and thus can be associated with any method/scheduling. It only orders commits, and aborts if violation is discovered in the operation scheduling order that happened earlier (possibly done by another mechanism, or random in optimistic CO). Thus not CO is the reason for such abort. 2PL prevents serializability aborts only to replace it with deadlocks: The scheduling, if not generating a cycle in the precedence graph, generates it in the wair-for graph, and abort to breac the cycle is unavoidable! See Serializability article in Wikipedia.

  5. Yes.

    The theory of commitment ordering introduces a new graph, the augmented conflict graph, which is the union of the precedence graph and wait for graph. This graph captures all conflicts, both materialize and nonmaterialized. A cycle in it is the ultimate test for either serializability violation or a deadlock, regardless of CC method. 2PL, optimistic, whatever. A cycle is determined only by the order transactions request IO. Explained in Wikipedia's articles.

  6. In the article above there is a mistake, which is common: Two-phase-commit protocol 2PC by itself does not solve the serializability anomaly! It is 2PC+CO only that solves! It is true that most databases are Strong strict 2PL (SS2PL), and thus also CO. Thus SS2PL+2PC solves the problem, and this is the source of the mistake in the article (that 2PC by itself) since SS2PL is taken for granted (which should not. also other than SS2PL mechanism may exist)

  7. Serializable SI [2] does not mention distribution and does not seem to scale. CO [1] is intended for distribution and seems to be the only existing general solution for global serializability that scales.

  8. SerializableSI does not comply with CO (MVCO) and thus does not scale effectively. No use to combine SerializableSI with MVCO.

    Combining regular SI with MVCO makes more sense and scales, but likely that some of the simplicity and effectiveness of SI is lost by the need to explicitly record and follow (MV) conflicts to determine CO. However for scalability there is no way around MVCO with the way transactions are normally scheduled. MVCO is the price to pay for SI scalability, and it is a very reasonable price.

  9. In the above answer

    SI is Snapshot isolation
    SerializableSI is Seralizable snapshot isolation in [2]
    MV is multi-version

  10. The MVCO theorem prohibits SerializableSI scalability by DBs' autonomy but does not prohibit scalability of SI itself (with no full serializability and need to check if and how scalability) because we are not talking local and global serializability with SI. However combining SI with MVCO (COSI) provides both serializability and scalability by DBs' autonomy (no need in CC info distribution).