Thursday, October 14, 2010

Choosing a Datastore

In thinking about which database might be best for any particular job, it's easy to get lost in the PR. Advocates of traditional relational database systems like Oracle and PostgreSQL tend to focus on the fact that systems are feature-rich and provide features such as atomicity, consistency, isolation, and durability (ACID), while advocates of document databases (like MongoDB) and key-value stores (memcached, Dynamo, Riak, and many others) tend to focus on performance, horizontal scalability, and ease of configuration.  This is obviously an apples-and-oranges comparison, and a good deal of misunderstanding and finger-pointing can result.  Of course, the real situation is a bit more complicated: everyone really wants to have all of these features, and any trade-off between them is bound to be difficult.

It seems to me that a big part of the decision comes down to which problems you want to solve within the database and which problems you want to solve using some other part of your architecture.  While there may be a few large web sites that have absolutely no data that requires ACID semantics, I suspect there aren't very many.  For example, if users are making purchases through your web site, you're not going to want to take any risk of charging the user's credit card and then failing to ship the product, or, just as bad, the other way around.  Whatever system you use to track that information needs to be extremely reliable, and not prone to losing transactions in the event of, say, an unexpected power failure.  On the other hand, if you're hoping that a large number of users will visit your web site (and if it's a public-facing web site, you probably are) you're going to need to think about scalability. Assuming that you can run the whole system off of one gigantic database is not necessarily a practical, or cost-effective, solution.  It would be much nicer to have a system that allows you to add new nodes, preferably on the fly, and preferably really quickly, so that if a link to your site gets posted on Slashdot, you can rent a few more virtual machines from your favorite cloud provider and ramp up to meet demand.

Unfortunately, with the products that are available on the market today, you can't really have everything all at once.  There are tools (like pgpool-II and GridSQL) that are designed to work with PostgreSQL to improve scalability; and there is a profusion of new products that attack the problem from the other direction - say, by starting with a distributed hash table like memcached, and then attempting to add durability characteristics.  In both cases, there are some rough edges.  Over time, I suspect we'll see some convergence, as some of the best ideas from different products get folded together to create new solutions that combine the best characteristics of what's out there today.  In the meantime, though, we'll need to decide between the tools that are available now.  I'll freely admit to being biased: I'm a ten-year veteran of PostgreSQL, and I love the feature set it provides.  But there's something else I like about it, too: it's fourteen years old.

Old is good, because old gives you time to find the bugs.  We're still finding 2-3 bugs per month in PostgreSQL that have existed since the release of PostgreSQL 7.4.0 in November 2003, as a quick perusal of the commit log for the REL7_4_STABLE branch (just recently end-of-lifed) will demonstrate.  Of course, the newer releases, while they have many great new features and performance improvements that make an upgrade highly worthwhile, have even more bug fixes, because they contain a small percentage of new code and, of course, that new code is where the new bugs are most likely to be uncovered. And this problem is certainly not unique to PostgreSQL: every software development project inevitably has bugs due to some combination of sloppy coding and design choices that in retrospect turn out to be wrong.  In the words of Frederick Brooks:
All programmers are optimists. Perhaps this modern sorcery especially attracts those who believe in happy endings and fairy godmothers. Perhaps the hundreds of nitty frustrations drive away all but those who habitually focus on the end goal. Perhaps it is merely that computers are young, programmers are younger, and the young are always optimists. But however the selection process works, the result is indisputable: 'This time it will surely run' or 'I just found the last bug'.
Reality has a way of intruding on our optimism.  All systems do fail in real life, and not only because of software bugs.  The best we can do is try to anticipate when and how they will fail.  Another advantage of running old software is that it's possible, with proper training, to have a pretty good idea where the bodies are buried - to know what is likely to go wrong, and how it's likely to go wrong, and what can be done about it.  With new systems and new technology, that's sometimes a little less clear, as it seems to have been in the case of Foursquare's lengthy outage earlier this month (which is futher discussed on the MongoDB Google Group).  Of course, even with supposedly well-tested technology, really bad things can happen, as in the case of American Airlines Flight 587, which Bryan Cantrill recently discussed at Surge: airplanes should be designed so that the tail fin doesn't snap off, even if you stomp on the rudder really hard.  All things being equal, though, such design flaws are more likely to exist in newer technology.

In the end, I think that the choice of datastore is likely to be influenced by many factors: the design of the application, the character of the underlying data, and of course the preferences, knowledge, and skillset of the person making the decision.  But as you go about making that decision, it's worth thinking about the fact that relational databases such as PostgreSQL, MySQL, and Oracle have been
around for a very long time, have very stable code bases, and their failure modes are well-understood and well-documented.


  1. There's a well understood principle that the more safety you need the more you over-engineer things. If you're building a bridge you should design it to withstand much more stress than it might possibly bear. If it needs to bear a maximum load of 3000 tons, you build it bear, say, 10,000 tons.

    The same principle applies to software if it's running a pacemaker or controlling a nuclear power station. As we get away from things that could cost lives, the need to over-engineer gets fuzzier.

    Banks used to be paranoid about software correctness, but seem less so these days, sadly. And for many large data stores, a loss of service or coherence won't be a tragedy. If Google or Facebook lose track of a few nodes for a while, the world won't end. loss of coherence in secondary data stores like data warehouses doesn't matter as much either.

    Very few people will need shopping carts spread over large numbers of nodes, though. A single instance of PostgreSQL can handle a staggeringly large volume of sales transactions with ease. I strongly suspect many people are jumping into what Richard Hipp calls the Post-Modern Databases because it's sexy and has buzz, not for any great technical reason.

  2. I think any mention of scaling out with PostgreSQL should at least mention that plproxy exists and is a reliable solution.