Tuesday, February 01, 2011

MySQL vs. PostgreSQL, Part 2: VACUUM vs. Purge

Almost two months ago, I wrote part one of what I indicated would be an occasional series of blog posts comparing the architecture of PostgreSQL to that of MySQL.  Here's part two.  Please note that the caveats set forth in part one apply to this and all future installments as well, so if you haven't read part one already, please click on the link above and read at least the first two paragraphs before reading this post.

Both InnoDB and PostgreSQL - as well as many other databases - use a technique called multi-version concurrency control (MVCC) to provide transaction isolation: transactions should not see the work of other, uncommitted transactions.  MVCC means that, when a row is updated, the database stores both the old and new versions of the row.  The alternative would be to lock the updated row until the inserting transaction committed, so that no one else could see the uncommitted update - but this is an unpopular approach, because a long-running transaction can acquire many locks and bring the system grinding to a halt.  Even after a transaction commits, the old version of the row may continue to be seen by commands or transactions that began before the commit, so that their view of the database doesn't suddenly change under them.  Abrupt changes in the way the client sees the database can lead to very surprising results, so both databases try to avoid letting that happen.

There's a fairly obvious problem with this scheme, though: if an update preserves the old version of the row while also storing the new one, then something's going to have to eventually go back and get rid of the old row, or the database will just keep growing and growing as rows are updated.  That something, in the case of InnoDB, is purge, and in the case of PostgreSQL, is VACUUM.  In both cases, the old version of a row that has been updated can be removed once all transactions or commands that could possibly wish to refer to the old version have completed.  In both cases, also, deleted rows are kept around until any commands or transactions that might wish to see them have completed, and those must be removed as well.  However, the implementations are quite different in detail.

In InnoDB, only the most recent version of an updated row is retained in the table itself.  Old versions of updated rows are moved to the rollback segment, while deleted row versions are left in place and marked for future cleanup.  Thus, purge must get rid of any deleted rows from the table itself, and clear out any old versions of updated rows from the rollback segment.  All the information necessary to find the deleted records that might need to be purged is also written to the rollback segment, so it's quite easy to find the rows that need to be cleaned out; and the old versions of the updated records are all in the rollback segment itself, so those are easy to find, too.  One small downside of this approach is that performing an update means writing two tuples - the old one must be copied to the undo tablespace, and the new one must be written in its place.

PostgreSQL takes a completely different approach.  There is no rollback tablespace, or anything similar.  When a row is updated, the old version is left in place; the new version is simply written into the table along with it.  As in InnoDB, records deleted are marked for future cleanup, but without also writing a record to the rollback tablespace.  Both of these differences result in slightly less work when the operation is initially performed, but the payback is the eventual cleanup is more expensive.  Lacking a centralized record of what must be purged, PostgreSQL's VACUUM has historically needed to scan the entire table to look for records that might require cleanup.  Beginning in PostgreSQL 8.3, there is an optimization called HOT (for "heap only tuple") which allows some vacuuming to be done on the fly in single-page increments; beginning in PostgreSQL 8.4 and higher, the system maintains a bitmap, called the visibility map, which indicates which pages of the table might possibly contain tuples in need of cleanup, and VACUUM can scan only those pages.  However, a full scan of each index is still required during each VACUUM, make it still a somewhat expensive operation for large tables.

Since both systems can potentially store multiple versions of any given tuple, both can lead to "bloat", where the size of a table grows far beyond the amount of data actually stored in it, the bloat shows up in different places.  Under InnoDB, most of the bloat (with the exception of any not-yet-removable deleted rows) is in the rollback tablespace, whereas in PostgreSQL it's mixed in with the actual table data.  In either case, the problem occurs mostly (a) in the presence of long-running transactions or (b) when VACUUM and/or purge can't keep up with the rate at which old row versions are accumulating and must be removed.  Beginning in PostgreSQL 8.3, VACUUM is typically performed automatically in the background, using multiple (three by default) concurrent worker processes.   MySQL performs purges in the background, but it is single-threaded.  Percona Server, and possibly other MySQL forks, offer multiple purge threads.

As in my previous post, I want to avoid making statements about which system is better.  It's probably fairly clear from the previous paragraphs that I am much more knowledgeable about both the improvements in recent PostgreSQL releases and the limitations that remain than I am about anything on the MySQL side.  Please feel free to offer comments, corrections, or additional details below.


  1. Nice write up Robert. Good to see an up to date, clear and consciously unbiased comparison of the two architectures. Appreciate the anonymous comment option too.


  2. In InnoDB, is the "free list" of space marked deleted just kept around in the rollback segment for as long as the space is there? Also, are rows the same size, or else does InnoDB prefer to overwrite in-place? Is there a fragmentation issue? A separate compaction process?

    Also, for PG: (non-FULL) VACUUM just scans to build a free list, right? Could this be maintained on-the-fly instead?

  3. PostgreSQL makes fastest updates and MySQL makes the PURGE fastest than the Vacuum , what is more important to be fast? an Update o Vacuum. Thats the question

  4. @Anonymous: Thanks.

    @Yang: It's not exactly a free list, but it does mark the space as available for reuse. A HOT prune can do that part on the fly to a limited degree; however, there's still the problem of reclaiming index entries. It's hard to do the whole thing on the fly due to MVCC visibility rules - the tuple actually becomes dead when the last snapshot that can see it is released, which is totally disconnected from the event of marking the tuple deleted.

  5. @Hugo Rafael Lesme Marquez: I'm not sure that question has one right answer.

  6. I believe that the Updates are most important when you use it a lot, and Vacuum is more important in environments where databases are used with many terabytes of data and use them 24 hours a day because Vacuum can cause slow when implemented on tables too large. In environments that use the database between 22 or fewer hours per day without hesitation I would choose faster updates. A scheduled task for the Vacuum be made within two hours left over each day. But if you never bothered by Vacuum, then automatically schedule a task to be performed continuously and automatically the vacuum on larger tables. All this also depends on the server if you use a storage system fast or slow.

  7. In InnoDB the additional IO is normally just theoretical. Modifications go through the InnoDB buffer pool, and assuming you don't have extremely long transactions, purge will often remove the record before it has actually been written to disk.

    If there is a really long transaction, then it might not remain in the buffer pool. Cases like this is where the single threaded purge thread can really hurt since purge becomes disk bound and very slow at catching up.

  8. For the vast majority of cases, single-threaded purge isn't even a problem in InnoDB. What is (was) the problem was when purge was done as an intermittent task by InnoDB's main thread, among several other tasks it did in a loop. Having a dedicated purge thread, even if it's only one thread, is sufficient for every case I've ever seen personally. Harrison, does Facebook use multiple purge threads?

  9. Except InnoDB clusters the base table by primary key and PG does not. So if you want an access path across the primary key in PG you manage the heap and the index, while in InnoDB it's just the index.

  10. You got a little typo:

    "MySQL performs performs purges ..."

  11. @Harrison: Understood. I was actually referring to CPU and buffer management cost, not I/O cost. It's probably small enough not to be terribly noticeable, but it can't be free; and I do think that avoiding the need for that sort of copying is one reason for the PostgreSQL design, though it certainly doesn't mean that the PostgreSQL design is better; I'm not convinced that it is.

    A small PostgreSQL table can get vacuumed before it ever leaves the buffer pool, too, but I don't think small tables are a major concern in either system any more. It's the big ones that cause problems - where you need to evict the pages and then eventually read some or all of them back in for cleanup.

    @Baron: Interesting. So why does Percona server offer multiple threads?

    @Sergei: I'm not sure that has much to do with this particular issue, although I did discuss it in my earlier post.

    @Martin LeBlanc: Thanks, fixed.

  12. Robert, it was this statement that caught my attention:

    > One small downside of this approach is that performing an update means writing two tuples - the old one must be copied to the undo tablespace, and the new one must be written in its place.

    Let's say I have a simple schema:
    create table foo(a int primary key);

    In InnoDB the write goes into the clustered index, redo log, and the rollback segment.

    In PG, the write goes into the heap, the WAL, and the index.

    Same number of writes.

  13. @Sergei: After thinking about this for a bit, I believe that in the case of a non-HOT update your analysis is basically correct, but for a HOT update PostgreSQL only does two writes, since there's no index update in that case.

    I'm reluctant to rely on this path of analysis for very much, though, because there are a lot of other things that go into performance, and really the only way to know what's going to work better for your workload is to try it. The depth of the index isn't necessarily the same in PostgreSQL and InnoDB (which one is deeper? I don't know, and it may depend on the width of the primary key relative to the table row), and the chances that an insert will require a page split are probably also different (and I'm not sure which one will need to do that more often in real-world workloads). All of these factors affect how much work actually will need to get done in a particular case.

  14. It seems like the InnoDB approach would complicate reads, as well. I don't know how it works, but I assume it has to be careful not to miss tuples in a scan as they are being moved around.

    Also, what if the old tuple fits in the page and the new one doesn't? Does it just do a DELETE/INSERT instead? And if so, does it have a third version in the rollback segment, or does it optimize that away?

  15. @Jeff Davis: AIUI, it scans the table in key order, rather than physical order. See the "part 1" post in this series. If the tuple doesn't fit into the page, the page must be split - but that's OK, because the secondary indexes point back to the PK, not the physical location (CTID) as they do in PostgreSQL.

  16. Robert, Percona Server offers multiple threads because at a minimum, we don't like hard-coding things; history has proven that "oh, X is enough for this" is wrong. But again, some of my colleagues who actually created that functionality might have needed multiple threads, I don't know. I just know that the really grievous problems I've seen were solved with one dedicated thread.

    One thing I'd like to caution about analysis of the amount of work it takes to make a change in InnoDB. It has had something called the "insert buffer" forever, and in recent releases it's changed to the "change buffer" because it delays the work involved in more than just inserts. This design can significantly reduce the amount of work required.

    About heaps and b-trees and CTIDs and such, there is one bit of trivia that I want to mention; each leaf node of InnoDB's clustered index (which is the table, and is a b+tree) actually contains a heap of records. So it's a b-tree until you get to the leaf node (== page), and then the records on the page are organized in a heap. Hope that makes sense.

  17. @Baron, Facebook does not use multiple purge threads. We have seen a few cases where it might have been useful, but normally those point to a bigger issue which we fix (reallly long transactions and saturated disk I/O).

    @Robert, It seems more likely that an undo segment would remain in memory compared to large portions of indexes which need to be scanned with vacuum. I agree that it doesn't make total sense to discuss cache use with regards to this though.

    The size of the undo segments remain roughly fixed based on transaction length and rate of changes, not based on the size of the table. Index size naturally grows with table size.

    The visibility map optimization did a lot to make vacuum more manageable, the primary thing that doesn't seem to scale well now is the full index scans now.

  18. I have not had experience in managing MySQL, but it uses the same technique as Oracle, or rollback segments.

    Here is what I have found:
    1) Oracle rollback segments tend to take manual management, and tend to cause issues with large or long running queries, either running out of rollback space, or timing out giving a snapshot too old error.

    2) The PostgreSQL method tends to cause the core table space to grow over time (vacuum, even full seems to miss something).

    Both methods have their issues. The Oracle method tends to be more controlled on how much disk is used, but takes more administrator intervention. PostgreSQL is easier to "set and forget" but does not seem to clean up after itself as well. With the price of disk and labor lately, I guess that automation wins out, but it is close.

  19. Oracle rollback segments have some issues. If they are too small then large transactions will fail because they will run out of space. If they are too large you get other problems.

    Oracle has switched from Rollback Segments to an Undo Tablespace, which I understand gets rid of these issues.

  20. @Grant: While the concept is similar, InnoDB rollback segments aren't managed the same as Oracle. They are stored inside of a tablespace and can grow and contract as needed. It is much more automatic and normally doesn't have any of the traditional rollback segment related Oracle problems. There is no 'snapshot is too old' type error in InnoDB.

  21. Grant, MySQL rollback segments don't have a size limit and will grow to fill the hard drive if the server never catches up. I've seen it about twice in six years and we added an option innodb_max_purge_lag to increase priority for purge after the first of them. Now the independent purge thread should make that even less likely, though setting innodb_max_purge_lag to 10 million or 100 million is still of possible use to cover cases where it might not do the job. Benchmarks.

    All of this is irrelevant for almost all users. It's just handled automatically and that works. Production servers tend to have far more banal issues. The usual sort of query optimisation or badly wrong settings. Knowing that that's all that's wrong and eliminating the rare cases is where the knowledge comes in.

    This is just my opinion, not the official view of the company. Talk to the press people if you want the latter.

    James Day, MySQL Principal Support Engineer, Oracle