Robert Haas: Fast Counting

Since I wrote my previous blog entry on index-only scans, quite a bit of additional work has been done. Tom Lane cleaned up the code and improved the costing model, but possibly the most interesting thing he did was to allow index-only scans to be used for queries that don't involve an indexable condition at all. The classic example is SELECT COUNT(*) FROM table. In previous versions of PostgreSQL, there's just one way to implement this: sequential scan the table and count 'em up. In PostgreSQL 9.2, that method will still, of course, be available, but now there will be another choice: pick any index you like and do a full index scan, checking whether each tuple is all-visible either using the visibility map or via a heap fetch. So, how well does it work?

I did some benchmarking. Here's the short version: If the index fits in memory but the table doesn't, the index-only scan comes out way ahead, because it avoids hitting the disk, which is obviously a huge win. If the index fits in shared_buffers but the table doesn't, the index-only scan still comes out significantly ahead, because buffer eviction isn't free. If the table fits in shared buffers, and is actually fully resident in shared buffers, it's faster to just sequentially scan it. As Tom Lane pointed out, the sequential scan code has been with us far longer, and is far better optimized, than the index-only scan code. Even once the code has been well-tuned (we found one place to save some cycles already), scanning the table will likely still be faster in many cases: it has the advantage of being dead simple.

Of course, since the query planner considers both plans, there shouldn't be a big performance regression even if the index-only scan method is far slower: the planner will just pick a sequential scan instead. My testing thus far suggests that the planner won't always get it right. Which plan is faster depends not only on the size of the table and the index (which the planner knows) but also on exactly what the contents of the buffer cache are at the time the plan is executed (which the planner does not know; and to some degree can't really know since a single query can change the situation considerably). But the planner should at least pick up the really egregious cases, and if it misses occasionally when the two plans are close to equal, it shouldn't hurt that much.

And, of course, the real point is that the people who have been clamoring for index-only scans both in general and as a means of speeding up COUNT(*) are not primarily worried about very small tables anyway: they're worried about big tables that don't fit in memory, and those cases already show substantial speed-up. Trimming CPU overhead usage probably won't make these cases much better than they already are.

In the long run, I think there's probably some room for further improvement. One idea, which I mentioned in my first-ever blog post on this topic, is to read the index in physical order rather than logical order. That might be quicker, especially if the index has to be read in from disk, but there are difficulties: we'd somehow need to make sure that we didn't read the same tuple more than once, and since tuples can move around in an index as a result of inserts and updates, that seems tricky. And it's probably only going to help on really, really big tables, since if the index is all in cache then there shouldn't be much effect, unless the physical order scan can manage to be more CPU-efficient.

I think this code could also be improved by adding some pre-fetching logic. If we know we're going to need particular heap (table), visibility map, or index page, we could kick off an asynchronous I/O request to pull in the data before we actually need it, in the hope of finding it available by the time we really do need it. The trick is figuring out when and how much to prefetch.

I'm not sure how much of this tuning will get done in time for PostgreSQL 9.2. The trick with a new feature like this is that we don't really have much field experience with it yet; the problems we run into in the lab may or may not be representative of what happens in the real world. The good news is that even without any further optimization, there should be many real-world cases where the code as it stands to day will be a big improvement over previous versions.

Robert Haas

Monday, October 31, 2011

Fast Counting

5 comments: