Monday, November 22, 2010

Index-Only Scans

There seems to be a lot of interest in the as-yet-unimplemented performance feature called index-only scans, so I thought it would be useful to explain a little bit more about what this feature is, how it will help PostgreSQL, and where the bodies are buried.

First, the name.  What do we mean by an index-only scan?  In PostgreSQL today, an index scan always accesses both the index itself and the underlying table.  You might think this unnecessary.  For example, if you have the query SELECT name FROM table WHERE id = 10, and there is an index on (id, name), you might assume that we could use the index to check for tuples with id = 10, and the if one is found, return the name directly from the index tuple, without consulting the underlying table.  Unfortunately, this does not work, because that tuple might not actually be one that the SELECT statement can see.  If the tuple was inserted by a transaction which began after the SELECT statement took its MVCC snapshot, or deleted by a transaction which committed before the SELECT statement took its MVCC snapshot, then the SELECT statement must not return it.  If it did, we would quickly get very surprising wrong answers out of the database.  So PostgreSQL first looks at the index tuple, and then the heap (table) tuple, decides what the right thing to do is, and does it.  By an index ONLY scan, we mean one which will look at just the index, and not at the corresponding table; the trick is to figure out how to make that happen without returning wrong answers.

Given PostgreSQL's overall architecture, there doesn't seem to be much hope of optimizing away the MVCC visibility checks in the cases where the table is undergoing significant churn.  Duplicating the MVCC visibility information in the index would make indexes much larger and less efficient, so it's very unclear that it would be a good trade-off even if we could work out the implementation details.  However, in many cases, a large table will have only intermittent write activity or no write activity at all.  For example, if we took the system down to single user mode and vacuumed an individual table, we would then be guaranteed that every tuple in that table would be visible to every subsequent transaction until the next time the table was modified - so an index-only scan would be safe and fast.  Of course, this is a pretty impractical maintenance procedure, so we need a better solution.

As it turns out, much of the legwork needed for a better solution has already been done.  In PostgreSQL 8.4, Heikki Linnakangas introduced a structure called the visibility map.  The visibility map stores one bit per table page (so it's very small compared to the underlying table, and easily fits in cache) indicating whether all the tuples on the page are known to be visible to all current and future transactions on the system.  Vacuum - or, more usually, autovacuum - sets this bit, and write activity on the page clears it.  This bit is used as a hint to avoid repeatedly vacuuming pages that are known not to need it.  This gives us some of the benefits of the rollback-segment model used by MySQL and Oracle - we have some idea where the old versions needing cleanup might be, versus needing to scan the entire table every time.  However, the visibility map has one major problem that makes it unsuitable for index-only scans: it's not crash safe.  A crash at an inconvenient time could leave the visibility map bit set while the page does in fact contain dead tuples.  For vacuum, that's not critical: the worst thing that can happen is that a few dead tuples stick around wasting space.  Your database isn't likely to crash often enough for this to become a problem in practice, and we still very occasionally do full-table vacuums to prevent a phenomenon called "transaction ID wraparound", so it will eventually get cleaned up.  However, for query execution, it's not OK: we don't ever want to take the risk of returning wrong answers.

So, the basic roadmap for index-only scans looks like this:

1. Make the visibility map crash-safe, so we can rely on it for query execution.  This is probably the hard part.

2. When performing an index scan that only needs to return attributes which are present in the index, check the visibility map bit before examining the table page.  If the visibility map bit is set, then return the data directly from the index tuple; otherwise, examine the table page and proceed as we do today.

The index-only scan, therefore, will only be entirely index-only if every relevant bit in the visibility map is set.  We'll still access the table to the extent necessary to be certain of returning correct answers.  However, if most of the visibility map bits are set (which should be a fairly common scenario), it will still save a great deal of I/O, and some CPU time, too.

One relatively small fly in the ointment is that the planner's costing model needs to reflect the actual costs of performing an index scan.  If the executor knows that it can sometimes skip reading from the table, but the planner isn't aware of this effect, then the planner will sometimes pick the wrong plan, thinking that the index scan will be more expensive than it really will be.  To estimate this properly,  I'm guessing that we're going to need VACUUM or ANALYZE to gather one additional statistic on each table: the estimated fraction of pages, or tuples, for which the visibility map bit is set.

Once this basic infrastructure is in place, there are a number of possible opportunities for further improvement:

1. Large insert-only tables.  Large insert-only tables are not automatically vacuumed (except for transaction-ID wraparound), because autovacuum is triggered by updates and deletes.  This is generally a good thing, because it saves a great deal of not-very-useful work.  However, it's problematic for index-only scans, because it also means the visibility map bits won't get set.  I don't have a very clear idea what to do about this, but it's likely to need thought and work.  For a first version of this feature, we'll likely have to rely on users to do a manual VACUUM in this case.

2. Delayed heap fetches.  Consider a query of the form SELECT * FROM foo, bar WHERE foo.x = bar.x and foo.y = 1, and assume there's an index on foo (y, x).  On the surface, this doesn't seem like a good candidate for an index-only scan, because we need to return all the attributes of each tuple in foo, not just x and y.  But we could do this: find all the index tuples that have foo.y = 1, then perform the join against bar using the join clause foo.x = bar.x (could be a merge, hash, or nestloop join), and then only at the end fetch the other columns of foo from the underlying table (MVCC visibility checks would also be performed at this time).  This would be advantageous if most rows in foo don't have a matching row in bar - the join will throw away a bunch of tuples from foo, reducing the number of table pages we need to examine.  On the other hand, you could also have a situation where each row in foo is expected to match multiple rows in bar - in which case postponing the heap fetches until after the join would slow things down.  The query planning problems around this optimization seem fiercely difficult; and there are also some security concerns here: letting user-defined functions view tuple data that hasn't been checked for MVCC visibility yet seems scary.  But there are certainly queries that would benefit if we could make this happen.

3. Gratuitous use of "useless" indexes.  Consider a query like SELECT count(*) FROM foo.  If most of the visibility-map bits are set for foo, it might be faster to scan an index and count the index tuples (consulting the underlying table only for pages where the visibility-map bits are unset) rather than scanning the whole table.  Right now, the query planner will observe that the query contains no joins and has no ORDER BY clause, and decide that an index can't possibly make it faster; and right now that's true.  But it won't necessarily be true any more once index-only scans are in place.

4. Sequential index scans.  Building on the idea in #3, right now, indexes are always traversed in index order, which is not the same as physical order.  The previous example could potentially be made even faster if the index could be scanned in physical order.  However, our current btree indexes are not designed to be traversed in this way, so there are likely some concurrency problems in trying to do this.  I don't have a clear feeling for whether or how those problems are solvable, but I'm certain there will be further curiosity about this once the basic infrastructure is in place.


  1. "1. Make the visibility map crash-safe, so we can rely on it for query execution. This is probably the hard part."

    Can you clarify?

    WAL logging the set/unset seems like it would solve that problem. And it doesn't immediately strike me as too expensive, because it's at the page level.

  2. @Jeff Davis: Any action that can result in a visibility map getting cleared is already WAL-logged, and we can and already do clear the bit on redo. We currently don't XLOG setting the bit, so the PD_ALL_VISIBLE flag on the page can get out of sync with the copy in the visibility map, which is where things get sketchy: a subsequent update will see that PD_ALL_VISIBLE is unset on the page and will therefore not clear it in the visibility map either.

    To escape that, we'll need to WAL-log setting the visibility map bit, but doing that once per page would might be too expensive, and doing it for a range of pages seems complex.

  3. Robert,

    It still doesn't strike me as a big problem compared with the rest of the patch. I suppose I should look at the archives and the code to get more details.

  4. Interesting. How is this done in InnoDB's MVCC architecture?

  5. This would be the very significant performance upgrade, especially for analytical queries.
    In some cases, I have seen query sitting at 100% I/O doing random reads both on index and query.
    So for DW-type of data when partitions are created by one process and then never (or seldom) modified, could we implement something like "read-only" tables or even read-only tablespaces, like in oracle? Then we wouldn't need to worry about visibility map for these read-only tables and there is still significant advantage for DW-type queries.

  6. This is good news to me.

    There are a few details on how InnoDB supports index only scans on the InnoDB blog. It doesn't guarantee that scans will be index only but it makes it very likely that they will be.

    I think I will add code to my MySQL tree to count the number of times when it is and is not index only.

  7. Can't we just log the whole map or fsync it to disk only on checkpoints. That way it will be updated with the redo.

    The wal may not practical for huge databases but disk image would be fine. A 64TB DB would have 1GB visibility map.

  8. A very simple solution:

    During crash recovery, mark every page as potentially having dead tuples. Vacuum will run a little slower initially, but quickly, things will get back in sync.

    It is crash safe now, with a minimum of effort, and no chance I see for it to ever do the wrong thing.

  9. It seems as if being able to set just the index in question to read only mode would provide one simple way to implement the feature with maximum performance. Methods which monitor the update page flag for the table data would seem a significantly more expensive option which would enable index physical scans in the less common case where a read only lock of the index in question is unacceptable.
    OTOH it might be nice to mmap() index such that the storage media wouldn't be touched for read only data selects, and the query optimizer could automatically infer the ability to pull fields from the most appropriate index.

  10. I haven't dug too deep into this but "Transactions" are run-time phenomena. Immediately after a server boots up isn't EVERY "current" record visible to all future transactions and all older record no longer necessary? Kinda like a VCS - once your release is deployed the history and concurrent updates, while interesting, are irrelevant since the "TRUE" view of the system has already been chosen.

  11. Let's take the read-only index a little further... what about a read-only table? This would be a useful baby step toward IOTs and other larger goals:


    Now PG should be able to say "this is a static table, no need to validate index lookups"

  12. See newer post