Comments on Robert Haas: MySQL vs. PostgreSQL, Part 2: VACUUM vs. Purge

Grant, MySQL rollback segments don't have a si...

2011-02-03T17:30:34.198-05:00

Grant, MySQL rollback segments don't have a size limit and will grow to fill the hard drive if the server never catches up. I've seen it about twice in six years and we added an option innodb_max_purge_lag to increase priority for purge after the first of them. Now the independent purge thread should make that even less likely, though setting innodb_max_purge_lag to 10 million or 100 million is still of possible use to cover cases where it might not do the job. Benchmarks.

All of this is irrelevant for almost all users. It's just handled automatically and that works. Production servers tend to have far more banal issues. The usual sort of query optimisation or badly wrong settings. Knowing that that's all that's wrong and eliminating the rare cases is where the knowledge comes in.

This is just my opinion, not the official view of the company. Talk to the press people if you want the latter.

James Day, MySQL Principal Support Engineer, Oracle

@Grant: While the concept is similar, InnoDB rollb...

2011-02-03T13:29:43.095-05:00

@Grant: While the concept is similar, InnoDB rollback segments aren't managed the same as Oracle. They are stored inside of a tablespace and can grow and contract as needed. It is much more automatic and normally doesn't have any of the traditional rollback segment related Oracle problems. There is no 'snapshot is too old' type error in InnoDB.

Oracle rollback segments have some issues. If the...

2011-02-02T17:55:31.913-05:00

Oracle rollback segments have some issues. If they are too small then large transactions will fail because they will run out of space. If they are too large you get other problems.

Oracle has switched from Rollback Segments to an Undo Tablespace, which I understand gets rid of these issues.

I have not had experience in managing MySQL, but i...

2011-02-02T16:39:00.749-05:00

I have not had experience in managing MySQL, but it uses the same technique as Oracle, or rollback segments.

Here is what I have found:
1) Oracle rollback segments tend to take manual management, and tend to cause issues with large or long running queries, either running out of rollback space, or timing out giving a snapshot too old error.

2) The PostgreSQL method tends to cause the core table space to grow over time (vacuum, even full seems to miss something).

Both methods have their issues. The Oracle method tends to be more controlled on how much disk is used, but takes more administrator intervention. PostgreSQL is easier to "set and forget" but does not seem to clean up after itself as well. With the price of disk and labor lately, I guess that automation wins out, but it is close.

@Baron, Facebook does not use multiple purge thre...

2011-02-02T15:30:47.268-05:00

@Baron, Facebook does not use multiple purge threads. We have seen a few cases where it might have been useful, but normally those point to a bigger issue which we fix (reallly long transactions and saturated disk I/O).

@Robert, It seems more likely that an undo segment would remain in memory compared to large portions of indexes which need to be scanned with vacuum. I agree that it doesn't make total sense to discuss cache use with regards to this though.

The size of the undo segments remain roughly fixed based on transaction length and rate of changes, not based on the size of the table. Index size naturally grows with table size.

The visibility map optimization did a lot to make vacuum more manageable, the primary thing that doesn't seem to scale well now is the full index scans now.

Robert, Percona Server offers multiple threads bec...

2011-02-02T13:23:28.753-05:00

Robert, Percona Server offers multiple threads because at a minimum, we don't like hard-coding things; history has proven that "oh, X is enough for this" is wrong. But again, some of my colleagues who actually created that functionality might have needed multiple threads, I don't know. I just know that the really grievous problems I've seen were solved with one dedicated thread.

One thing I'd like to caution about analysis of the amount of work it takes to make a change in InnoDB. It has had something called the "insert buffer" forever, and in recent releases it's changed to the "change buffer" because it delays the work involved in more than just inserts. This design can significantly reduce the amount of work required.

About heaps and b-trees and CTIDs and such, there is one bit of trivia that I want to mention; each leaf node of InnoDB's clustered index (which is the table, and is a b+tree) actually contains a heap of records. So it's a b-tree until you get to the leaf node (== page), and then the records on the page are organized in a heap. Hope that makes sense.

@Jeff Davis: AIUI, it scans the table in key order...

2011-02-02T13:05:54.641-05:00

@Jeff Davis: AIUI, it scans the table in key order, rather than physical order. See the "part 1" post in this series. If the tuple doesn't fit into the page, the page must be split - but that's OK, because the secondary indexes point back to the PK, not the physical location (CTID) as they do in PostgreSQL.

It seems like the InnoDB approach would complicate...

2011-02-02T12:52:07.705-05:00

It seems like the InnoDB approach would complicate reads, as well. I don't know how it works, but I assume it has to be careful not to miss tuples in a scan as they are being moved around.

Also, what if the old tuple fits in the page and the new one doesn't? Does it just do a DELETE/INSERT instead? And if so, does it have a third version in the rollback segment, or does it optimize that away?

@Sergei: After thinking about this for a bit, I be...

2011-02-02T12:49:45.825-05:00

@Sergei: After thinking about this for a bit, I believe that in the case of a non-HOT update your analysis is basically correct, but for a HOT update PostgreSQL only does two writes, since there's no index update in that case.

I'm reluctant to rely on this path of analysis for very much, though, because there are a lot of other things that go into performance, and really the only way to know what's going to work better for your workload is to try it. The depth of the index isn't necessarily the same in PostgreSQL and InnoDB (which one is deeper? I don't know, and it may depend on the width of the primary key relative to the table row), and the chances that an insert will require a page split are probably also different (and I'm not sure which one will need to do that more often in real-world workloads). All of these factors affect how much work actually will need to get done in a particular case.

Robert, it was this statement that caught my atten...

2011-02-02T12:24:16.771-05:00

Robert, it was this statement that caught my attention:

> One small downside of this approach is that performing an update means writing two tuples - the old one must be copied to the undo tablespace, and the new one must be written in its place.

Let's say I have a simple schema:
create table foo(a int primary key);

In InnoDB the write goes into the clustered index, redo log, and the rollback segment.

In PG, the write goes into the heap, the WAL, and the index.

Same number of writes.

@Harrison: Understood. I was actually referring t...

2011-02-02T12:09:47.388-05:00

@Harrison: Understood. I was actually referring to CPU and buffer management cost, not I/O cost. It's probably small enough not to be terribly noticeable, but it can't be free; and I do think that avoiding the need for that sort of copying is one reason for the PostgreSQL design, though it certainly doesn't mean that the PostgreSQL design is better; I'm not convinced that it is.

A small PostgreSQL table can get vacuumed before it ever leaves the buffer pool, too, but I don't think small tables are a major concern in either system any more. It's the big ones that cause problems - where you need to evict the pages and then eventually read some or all of them back in for cleanup.

@Baron: Interesting. So why does Percona server offer multiple threads?

@Sergei: I'm not sure that has much to do with this particular issue, although I did discuss it in my earlier post.

@Martin LeBlanc: Thanks, fixed.

You got a little typo: "MySQL performs perfo...

2011-02-02T11:07:12.990-05:00

You got a little typo:

"MySQL performs performs purges ..."

Except InnoDB clusters the base table by primary k...

2011-02-02T10:38:55.458-05:00

Except InnoDB clusters the base table by primary key and PG does not. So if you want an access path across the primary key in PG you manage the heap and the index, while in InnoDB it's just the index.

For the vast majority of cases, single-threaded pu...

2011-02-02T10:37:29.849-05:00

For the vast majority of cases, single-threaded purge isn't even a problem in InnoDB. What is (was) the problem was when purge was done as an intermittent task by InnoDB's main thread, among several other tasks it did in a loop. Having a dedicated purge thread, even if it's only one thread, is sufficient for every case I've ever seen personally. Harrison, does Facebook use multiple purge threads?

In InnoDB the additional IO is normally just theor...

2011-02-02T09:59:26.614-05:00

In InnoDB the additional IO is normally just theoretical. Modifications go through the InnoDB buffer pool, and assuming you don't have extremely long transactions, purge will often remove the record before it has actually been written to disk.

If there is a really long transaction, then it might not remain in the buffer pool. Cases like this is where the single threaded purge thread can really hurt since purge becomes disk bound and very slow at catching up.

I believe that the Updates are most important when...

2011-02-01T15:28:06.801-05:00

I believe that the Updates are most important when you use it a lot, and Vacuum is more important in environments where databases are used with many terabytes of data and use them 24 hours a day because Vacuum can cause slow when implemented on tables too large. In environments that use the database between 22 or fewer hours per day without hesitation I would choose faster updates. A scheduled task for the Vacuum be made within two hours left over each day. But if you never bothered by Vacuum, then automatically schedule a task to be performed continuously and automatically the vacuum on larger tables. All this also depends on the server if you use a storage system fast or slow.

@Hugo Rafael Lesme Marquez: I'm not sure that ...

2011-02-01T15:10:18.779-05:00

@Hugo Rafael Lesme Marquez: I'm not sure that question has one right answer.

@Anonymous: Thanks. @Yang: It's not exactly a...

2011-02-01T15:09:33.558-05:00

@Anonymous: Thanks.

@Yang: It's not exactly a free list, but it does mark the space as available for reuse. A HOT prune can do that part on the fly to a limited degree; however, there's still the problem of reclaiming index entries. It's hard to do the whole thing on the fly due to MVCC visibility rules - the tuple actually becomes dead when the last snapshot that can see it is released, which is totally disconnected from the event of marking the tuple deleted.

PostgreSQL makes fastest updates and MySQL makes t...

2011-02-01T15:07:19.285-05:00

PostgreSQL makes fastest updates and MySQL makes the PURGE fastest than the Vacuum , what is more important to be fast? an Update o Vacuum. Thats the question

In InnoDB, is the "free list" of space m...

2011-02-01T14:19:45.217-05:00

In InnoDB, is the "free list" of space marked deleted just kept around in the rollback segment for as long as the space is there? Also, are rows the same size, or else does InnoDB prefer to overwrite in-place? Is there a fragmentation issue? A separate compaction process?

Also, for PG: (non-FULL) VACUUM just scans to build a free list, right? Could this be maintained on-the-fly instead?

Nice write up Robert. Good to see an up to date, c...

2011-02-01T11:31:04.314-05:00

Nice write up Robert. Good to see an up to date, clear and consciously unbiased comparison of the two architectures. Appreciate the anonymous comment option too.

-eyecue