There seems to be a lot of interest in the as-yet-unimplemented performance feature called index-only scans, so I thought it would be useful to explain a little bit more about what this feature is, how it will help PostgreSQL, and where the bodies are buried.
First, the name. What do we mean by an index-only scan? In PostgreSQL today, an index scan always accesses both the index itself and the underlying table. You might think this unnecessary. For example, if you have the query SELECT name FROM table WHERE id = 10, and there is an index on (id, name), you might assume that we could use the index to check for tuples with id = 10, and the if one is found, return the name directly from the index tuple, without consulting the underlying table. Unfortunately, this does not work, because that tuple might not actually be one that the SELECT statement can see. If the tuple was inserted by a transaction which began after the SELECT statement took its MVCC snapshot, or deleted by a transaction which committed before the SELECT statement took its MVCC snapshot, then the SELECT statement must not return it. If it did, we would quickly get very surprising wrong answers out of the database. So PostgreSQL first looks at the index tuple, and then the heap (table) tuple, decides what the right thing to do is, and does it. By an index ONLY scan, we mean one which will look at just the index, and not at the corresponding table; the trick is to figure out how to make that happen without returning wrong answers.
Given PostgreSQL's overall architecture, there doesn't seem to be much hope of optimizing away the MVCC visibility checks in the cases where the table is undergoing significant churn. Duplicating the MVCC visibility information in the index would make indexes much larger and less efficient, so it's very unclear that it would be a good trade-off even if we could work out the implementation details. However, in many cases, a large table will have only intermittent write activity or no write activity at all. For example, if we took the system down to single user mode and vacuumed an individual table, we would then be guaranteed that every tuple in that table would be visible to every subsequent transaction until the next time the table was modified - so an index-only scan would be safe and fast. Of course, this is a pretty impractical maintenance procedure, so we need a better solution.
As it turns out, much of the legwork needed for a better solution has already been done. In PostgreSQL 8.4, Heikki Linnakangas introduced a structure called the visibility map. The visibility map stores one bit per table page (so it's very small compared to the underlying table, and easily fits in cache) indicating whether all the tuples on the page are known to be visible to all current and future transactions on the system. Vacuum - or, more usually, autovacuum - sets this bit, and write activity on the page clears it. This bit is used as a hint to avoid repeatedly vacuuming pages that are known not to need it. This gives us some of the benefits of the rollback-segment model used by MySQL and Oracle - we have some idea where the old versions needing cleanup might be, versus needing to scan the entire table every time. However, the visibility map has one major problem that makes it unsuitable for index-only scans: it's not crash safe. A crash at an inconvenient time could leave the visibility map bit set while the page does in fact contain dead tuples. For vacuum, that's not critical: the worst thing that can happen is that a few dead tuples stick around wasting space. Your database isn't likely to crash often enough for this to become a problem in practice, and we still very occasionally do full-table vacuums to prevent a phenomenon called "transaction ID wraparound", so it will eventually get cleaned up. However, for query execution, it's not OK: we don't ever want to take the risk of returning wrong answers.
So, the basic roadmap for index-only scans looks like this:
1. Make the visibility map crash-safe, so we can rely on it for query execution. This is probably the hard part.
2. When performing an index scan that only needs to return attributes which are present in the index, check the visibility map bit before examining the table page. If the visibility map bit is set, then return the data directly from the index tuple; otherwise, examine the table page and proceed as we do today.
The index-only scan, therefore, will only be entirely index-only if every relevant bit in the visibility map is set. We'll still access the table to the extent necessary to be certain of returning correct answers. However, if most of the visibility map bits are set (which should be a fairly common scenario), it will still save a great deal of I/O, and some CPU time, too.
One relatively small fly in the ointment is that the planner's costing model needs to reflect the actual costs of performing an index scan. If the executor knows that it can sometimes skip reading from the table, but the planner isn't aware of this effect, then the planner will sometimes pick the wrong plan, thinking that the index scan will be more expensive than it really will be. To estimate this properly, I'm guessing that we're going to need VACUUM or ANALYZE to gather one additional statistic on each table: the estimated fraction of pages, or tuples, for which the visibility map bit is set.
Once this basic infrastructure is in place, there are a number of possible opportunities for further improvement:
1. Large insert-only tables. Large insert-only tables are not automatically vacuumed (except for transaction-ID wraparound), because autovacuum is triggered by updates and deletes. This is generally a good thing, because it saves a great deal of not-very-useful work. However, it's problematic for index-only scans, because it also means the visibility map bits won't get set. I don't have a very clear idea what to do about this, but it's likely to need thought and work. For a first version of this feature, we'll likely have to rely on users to do a manual VACUUM in this case.
2. Delayed heap fetches. Consider a query of the form SELECT * FROM foo, bar WHERE foo.x = bar.x and foo.y = 1, and assume there's an index on foo (y, x). On the surface, this doesn't seem like a good candidate for an index-only scan, because we need to return all the attributes of each tuple in foo, not just x and y. But we could do this: find all the index tuples that have foo.y = 1, then perform the join against bar using the join clause foo.x = bar.x (could be a merge, hash, or nestloop join), and then only at the end fetch the other columns of foo from the underlying table (MVCC visibility checks would also be performed at this time). This would be advantageous if most rows in foo don't have a matching row in bar - the join will throw away a bunch of tuples from foo, reducing the number of table pages we need to examine. On the other hand, you could also have a situation where each row in foo is expected to match multiple rows in bar - in which case postponing the heap fetches until after the join would slow things down. The query planning problems around this optimization seem fiercely difficult; and there are also some security concerns here: letting user-defined functions view tuple data that hasn't been checked for MVCC visibility yet seems scary. But there are certainly queries that would benefit if we could make this happen.
3. Gratuitous use of "useless" indexes. Consider a query like SELECT count(*) FROM foo. If most of the visibility-map bits are set for foo, it might be faster to scan an index and count the index tuples (consulting the underlying table only for pages where the visibility-map bits are unset) rather than scanning the whole table. Right now, the query planner will observe that the query contains no joins and has no ORDER BY clause, and decide that an index can't possibly make it faster; and right now that's true. But it won't necessarily be true any more once index-only scans are in place.
4. Sequential index scans. Building on the idea in #3, right now, indexes are always traversed in index order, which is not the same as physical order. The previous example could potentially be made even faster if the index could be scanned in physical order. However, our current btree indexes are not designed to be traversed in this way, so there are likely some concurrency problems in trying to do this. I don't have a clear feeling for whether or how those problems are solvable, but I'm certain there will be further curiosity about this once the basic infrastructure is in place.