Wednesday, November 17, 2010

When Your Data Isn't Made of Gold

Josh Berkus' recent blog posting on What We Should Be Learning from MySQL, part 2 includes the following quote: "We tend to treat all data in Postgres as if it were made of gold, and not all data is equally valuable."  He goes on to wonder what we can do to better provide for the case where your data isn't made of gold.


As it happens, I have an answer to that question - one answer, anyway.  I just recently submitted a patch implementing unlogged tables, which I hope will become part of PostgreSQL 9.1.  Unlogged tables are tables for which data changes are not written to the write-ahead log, meaning that they are not replicated and cannot survive a crash; instead, an unlogged table is automatically truncated at database startup.  They therefore are appropriate for non-critical data which you can afford to lose in the event of unexpected downtime, such as user session information.  What you get in exchange for giving up durability is better performance (disclaimer: not all workloads will show as much benefit as Andy Colson found here).

Of course, in existing versions of PostgreSQL, it is already possible to configure your database for fast, unreliable operation, as I've blogged about in the past.  Run at top speed and, if the database crashes, just blow it away and make a new one.  Unlogged tables, however, are in many ways even better suited to this use case.  First, they should be even faster than running with fsync turned off.  Instead of writing WAL but not making sure it hits the disk, we simply don't write WAL at all.  Second, you can mix more and less important data in the same database and get the durability guarantee that you want for each type.  As far as I know, no other database product provides such a feature.  And third, even if you aren't mixing unlogged tables with ordinary tables, it's nicer to have your tables get truncated (and you can just start loading data into them again) than to have your whole database be corrupted and need to be blown away and recreated from scratch (postgresql.conf, user provisioning, table schemas, etc.).

If unlogged tables aren't right for your use case, another important knob to consider - which already exists in released versions - is synchronous_commit.  When synchronous_commit is disabled, which can be done either globally or only for certain transactions, the write-ahead log entries for transaction commit is written to disk asynchronously.  If your system is properly configured, and synchronous_commit=on, then when you get that COMMIT message, your transaction is absolutely guaranteed to be durably committed.  With synchronous_commit=off, there is a very short window of time after you've received the COMMIT during which your transaction could still be lost in the event of a system crash.  Typically, synchronous_commit=off provides a significant performance boost - if it doesn't, chances are your system has a WAL reliability problem.  For many applications, especially web applications, a short window of time during which a transaction can be lost is an acceptable price to pay for a large performance increase.

I think that, in the future, we may be able to provide more options to allow people to relax the data integrity guarantees that PostgreSQL provides in controlled ways.  For example, I can imagine a "dirty read" table, where transactions are not used; instead, rows become visible as soon as they're inserted, and disappear as soon as they're deleted.  Such a table would be unsuitable for many business applications, but if your application only does single-row operations indexed by primary key, it might work just fine; and it would open up a number of interesting optimization opportunities that aren't available for ordinary tables.  Or, you might have a "no snapshot" table, where rows don't become visible until the inserting transaction commits, but we make no attempt to guarantee serializability: rows appear pop into existence the instant they're committed, and disappear out from under you if a deleting transaction commits.

Of course, it also bears mentioning that there are many cases where treating your data as if it were made of gold is a good thing.  Unlogged tables provide an option to weaken ACID semantics to improve performance; there is simultaneous work going on to allow trade-offs in the other direction: less performance for even better ACID semantics.  There is also general performance and reliability work going on where the community is simply making the product better, without giving up anything.  Ultimately, the goal is to developer a product that can be useful to many different people with many different use cases.  Ideas for further improvement are always welcome.

6 comments:

  1. -- Second, you can mix more and less important data in the same database and get the durability guarantee that you want for each type. As far as I know, no other database product provides such a feature.

    Not that I've used MySql in anything less than a dog's age, but their docs (and such) claim that tables can be served by whatever "storage engine" one specifies; so MyISAM for lead and InnoDB for gold.

    ReplyDelete
  2. Good work Robert. Completely agree.

    Not really sure what Josh was talking about either!

    PostgreSQL has supported facilities for altering the performance/durability options for data since 8.3 when I wrote asynchronous commit. So we've been able to offer both MyIsam and InnoDB features, but at the transaction level not just table level.

    @RYoung - Oracle also provides these features, though they are not well publicised. One negative aspect about their implementation is you are required to explicitly state what you want on the COMMIT command. So it requires application changes to implement and most of their own tools don't even support it, so in effect it doesn't exist.

    That concept of letting the user decide the exact mix of performance/durability they would like is also the basis for my work on synchronous replication. And before you ask, yes, still working on it!

    ReplyDelete
  3. How works this feature with streaming replication?

    Can i create a un-logged table on the master and this table will be visible on the client? I think yes, right? But, is it possible to insert data into this table ON THE CLIENT?

    That would be a nice feature, for instance for loadbalancing: storing, for instance, session data on a SR-client. Or for reporting-tasks on a sr-client.

    Thx for your work!

    Regards, Andreas 'akretschmer' Kretschmer

    ReplyDelete
  4. @Simon Riggs:
    MyISAM is dead for most serious MySQL users. Even for users that don't care about durability, the horrible concurrent performance and corruption issues make it painful for use at large scale.

    InnoDB offers similar options as PG in terms of durability. See http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit . It also has something akin to fsync=off, but that footgun is not documented.

    ...At some point I will respond to all the rest of hubbub I stirred up...

    ReplyDelete
  5. @andreas

    Since replication is based on the write-ahead log, unlogged tables will not be replicated.

    Currently, no writes at all are possible on the slave, not even to temporary tables. There are some technical challenges that make that difficult to implement (notably, XID assignment) but perhaps we'll overcome those in a future release.

    ReplyDelete
  6. Nice work Robert! We've (omniti) only been asking for this for 3 or 4 years now. On one system in perticular we have about 500 tables, all of which is constantly being turned over from a master, "golden" copy, and it generates hundreds of GB of xlogs every day. In case of a crash, we toss the data anyway and reload, so for us this will be awesome.

    A more general use case for other, but also along these lines, is for people with replicated systems using something like slony, where you can set up the slaves to not generate xlogs on any of the tables. This will seriously cut down on the amount of write traffic on these servers. It does make your read instances throw away nodes, but as you've noted, this is perfectly fine in a lot of cases.

    ReplyDelete