Monday, November 29, 2010

MySQL vs. PostgreSQL, Part 1: Table Organization

I'm going to be starting an occasional series of blog postings comparing MySQL's architecture to PostgreSQL's architecture.  Regular readers of this blog will already be aware that I know PostgreSQL far better than MySQL, having last used MySQL a very long time ago when both products were far less mature than they are today.  So, my discussion of how PostgreSQL works will be based on first-hand knowledge, but discussion of how MySQL works will be based on research and - insofar as I'm can make it happen - discussion with people who know it better than I do.  (Note: If you're a person who knows MySQL better than I do and would like to help me avoid making stupid mistakes, drop me an email.)

In writing these posts, I'm going to try to avoid making value judgments about which system is "better", and instead focus on describing how the architecture differs, and maybe a bit about the advantages of each architecture.  I can't promise that it will be entirely unbiased (after all, I am a PostgreSQL committer, not a MySQL committer!) but I'm going to try to make it as unbiased as I can.  Also, bearing in mind what I've recently been told by Baron Schwartz and Rob Wultsch, I'm going to focus completely on InnoDB and ignore MyISAM and all other storage engines.  Finally, I'm going to focus on architectural differences.  People might choose to use PostgreSQL because they hate Oracle, or MySQL because it's easier to find hosting, or either product because they know it better, and that's totally legitimate and perhaps worth talking about, but - partly in the interests of harmony among communities that ought to be allies - it's not what I'm going to talk about here.

So, all that having been said, what I'd like to talk about in this post is the way that MySQL and PostgreSQL store tables and indexes on disk.  In PostgreSQL, table data and index data are stored in completely separate structures.  When a new row is inserted, or when an existing row is updated, the new row is stored in any convenient place in the table.  In the case of an update, we try to store the new row on the same page as the old row if there's room; if there isn't room or if it's an insert, we pick a page that has adequate free space and use that, or failing all else extend the table by one page and add the new row there.  Once the table row is added, we cycle through all the indexes defined for the table and add an index entry to each one pointing at the physical position of the table row.  One index may happen to be the primary key, but that's a fairly nominal distinction - all indexes are basically the same.

Under MySQL's InnoDB, the table data and the primary key index are stored in the same data structure.  As I understand it, this is what Oracle calls an index-organized table.  Any additional ("secondary") indexes refer to the primary key value of the tuple to which they point, not the physical position, which can change as leaf pages in the primary key index are split.  Since this architecture requires every table to have a primary key, an internal row ID field is used as the primary key if no explicit primary key is specified.

Since Oracle supports both options, they are probably both useful.  An index-organized table seems particularly likely to be useful when most lookups are by primary key, and most of the data in each row is part of the primary key anyway, either because the primary key columns are long compared with the remaining columns, or because the rows, overall, are short.  Storing the whole row in the index avoids storing the same data twice (once in the index and once in the table), and the gain will be larger when the primary key is a substantial percentage of the total data.  Furthermore, in this situation, the index page still holds as many, or almost as many, keys as it would if only a pointer were stored in lieu of the whole row, so one fewer random I/Os will be needed to access a given row.

When accessing an index-organized table via a secondary index, it may be necessary to traverse both the B-tree in the secondary-index, and the B-tree in the primary index.  As a result, queries involving secondary indexes might be slower.  However, since MySQL has index-only scans (PostgreSQL does not [update: now it does]), it can sometimes avoid traversing the secondary index.  So in MySQL, adding additional columns to an index might very well make it run faster, if it causes the index to function as a covering index for the query being executed.  But in PostgreSQL, we frequently find ourselves telling users to pare down the columns in the index to the minimum set that is absolutely necessary, often resulting in dramatic performance gains.  This is an interesting example of how the tuning that is right for one database may be completely wrong for another database.

I've recently learned that neither InnoDB nor PostgreSQL supports traversing an index in physical order, only in key order.  For InnoDB, this means that ALL scans are performed in key order, since the table itself is, in essence, also an index.  As I understand it, this can make a large sequential scan quite slow, by defeating the operating system's prefetch logic.  In PostgreSQL, however, because tables are not index-organized, sequential scans are always performed in physical order, and don't require looking at the indexes at all; this also means we can skip any I/O or CPU cost associated with examining non-leaf index pages.  Traversing in physical order is apparently difficult from a locking perspective, although it must be possible, because Oracle supports it.  It would be very useful to see this support in MySQL, and once PostgreSQL has index-only scans, it would be a useful improvement for PostgreSQL, too.

One final difficulty with an index-organized table is that you can't add, drop, or change the primary key definition without a full-table rewrite.  In PostgreSQL, on the other hand, this can be done - even while allowing concurrent read and write activity.  This is a fairly nominal advantage for most use cases since the primary key of a table rarely changes - I think it's happened to me only once or twice in the last ten years - but it is useful when it does comes up.

I hope that the above is a fair and accurate summary of the topic, but I'm sure I've missed a few things and covered others incompletely or in less detail than might be helpful.  Please feel free to respond with a comment below or a blog post of your own if I've missed something.

[Update: See also part 2 of this series, discussing VACUUM vs. Purge]


  1. "[MySQL] can sometimes avoid traversing the secondary index"

    I believe this is a typo: don't you mean "primary index"?

    Also, I heard that creating even a secondary index on a table in MySQL involves a rewrite. Is that only true for MyISAM? Not true at all?

  2. @Jeff Davis: Yeah, I think I started by writing "avoid the second index traversal" and it got mangled.

    As to your second point, I think I've heard that rumor too, but I don't know whether it's true or not.

  3. So does that mean postgresql is actively working toward index-only scans?

  4. @Jeff Davis:
    Traditionally adding - or even dropping - a secondary index to MySQL always rewrites the entire table. In MySQL 5.1 with the InnoDB plugin (but not the older "builtin" InnoDB that's still enabled by default) "fast alter table" was introduced which improves this significantly:

    There are still many cases where "fast alter table" can't (or won't) be used, however.

  5. Good post! A bit confusing, though. A more structured discussion that first states how the thing is implemented in postgresql, advantages and disadvantages, how it is implemented in mysql, advantages and disadvantages, then discusses room for future improvements on both and does any comparisons would be easier to read I guess. Still, a very good read!

  6. MS SQL Server also supports both kinds of tables, depending on if you add a "clustered index" or not.

  7. Hey there's a missing R in your "about myself":

    "Database Architect @ EntepriseDB, PostgreSQL Committer."

  8. A couple of notes on top of those above:

    "Under MySQL's InnoDB, the table data and the primary key index are stored in the same data structure. As I understand it, this is what Oracle calls an index-organized table."

    Indeed, though another phrase, that is perhaps more popular in the MySQL world (and other databases), is "clustered index".

    "As I understand it, this can make a large sequential scan quite slow, by defeating the operating system's prefetch logic."

    InnoDB actually has it's own prefetch logic:

    "There are two read-ahead heuristics in InnoDB:

    In sequential read-ahead, if InnoDB notices that the access pattern to a segment in the tablespace is sequential, it posts in advance a batch of reads of database pages to the I/O system.

    In random read-ahead, if InnoDB notices that some area in a tablespace seems to be in the process of being fully read into the buffer pool, it posts the remaining reads to the I/O system."

  9. Oh, and read-ahead changed in the recent InnoDB plugin / 5.5 updates, a better description for the latest read-ahead algorithms is here:

  10. Nice post.

    Is this a typo? I know that full table scans for InnoDB are in key order but I thought full table scans for PG use the heap and are in file order?

    "I've recently learned that neither InnoDB nor PostgreSQL supports traversing an index in physical order, only in key order."

    Index-only scans are very important to me. However, index-organized tables are not free. Random inserts into a b-tree leave it ~70% full (in theory and practice) so InnoDB ends up using more space. It can also use more space when the primary key is big and there are several secondary indexes as it it repeated in all secondary index entries. We deal with that by try to limit to one of many secondary indexes or big primary keys.

    InnoDB has an "adaptive hash index" to map keys to leaf pages currently in the buffer pool. This reduces some of the overhead of navigating the b-tree. The mapping is a guess (it doesn't have to be correct). This doesn't work as great as it did before the arrival of large multi-core servers as the hash table can be a source of mutex contention. But I think that can be fixed and MySQL 5.5 already has a lock free hash table implementation. Regardless, this is a very clever feature.

    A lot of DDL (add/drop colum, add/drop index) use to require a table copy and rebuild of all indexes via incremental maintenance for InnoDB tables. That was very painful. With the arrival of the InnoDB plugin in MySQL 5.1 there is less pain as "fast index create" does what it should do in many more cases -- only the desired index is impacted, table copy is not done and external merge sort is used to sort the (secondary key, primary key) values as input to build the index.

  11. @Mark Callaghan: Thanks. No, it's not a typo. We can't traverse an INDEX in physical order, only key order. But the table itself isn't an index, and we do traverse that in physical order.

  12. Great Post.

    Mleith's comment above makes it clear to me that one of the bits of your post misses the point. Traversing an index in key order is slow not because it defeats the OS's prefetching algorithm, but becuase it requires seeks which are a lot slower than sequential reads. MySQL could prefetch all the blocks of the index but as long as it's doing it in random order it'll still be slow.

  13. I loved the article. I wish you could cover TokuDB, as it's really different from MySQL, PostgreSQL and most other DBs.

  14. Lawrence KestelootFebruary 02, 2011 1:56 PM

    One interesting use of index-organized table is to set up a slave with a different primary key than the master. Queries for row ranges for that key can be done on the slave, and they're much faster because all the rows are together. You don't need one seek per row. Updates to that column are slower, but maybe that's okay for your application if such changes are rare.

  15. Interesting that you mention both postgress and innodb inability to read in physical order. By it's very nature, SQL will not return records in any guaranteed order if no order by is specified. I have worked with DB's that allow record retrieval in physical order, but for that to be useful, you can't actually delete records from the tables, or should I say you can't reuse those rows, because then you loose the ability to process the records in FIFO or LIFO order. Most of us using databases that support such things are moving to a more SQL paradyme, and I can't remember the last time I actually used that feature. Probably not something that is important to worry about.