Monday, June 21, 2021

Talking about the PostgreSQL Optimizer at CMU

Professor Andy Pavlo, at CMU, seems to be a regular organizer of technical talks about databases; this year, he organized the vaccination database tech talks, and invited me to give one about the PostgreSQL query optimizer. So I did. It was great. There were a few PostgreSQL community members present, but more importantly, a bunch of smart people who know a lot about other database systems showed up to the talk, including Andy Pavlo himself, and I got some feedback on where PostgreSQL could perhaps be improved.  Here are the highlights, with links to the relevant portion of the YouTube video.

12m11s. Andy seems interested to hear that GEQO's cross-breeding algorithm seems not to be successful in improving plan quality. I hope I'm not wrong about this; it's my recollection from mailing list discussions years ago. Certainly, it doesn't seem like GEQO gets a lot of love. We didn't get into the question of whether other databases do better in this area.

19m49s. I start out complaining that when planning a sub-join we have little information about other relations involved in the query, which leads to a discussion of the merits of top-down planning vs. bottom-up planning. The Cascades framework gets mentioned. I muse about whether top-down might be better, leading Andy to say "You're the first person I've heard to say 'yeah, top down might be the right thing to do' when they already have a bottom-up implementation."

31m43s. It is proposed that perhaps PostgreSQL should use "sketches" like HyperLogLog or a Count-min to estimate row counts during planning. Since I'm unfamiliar with the term, I get to demonstrate my ignorance. The idea is interesting, though. As far as I know this is something that has never been proposed for inclusion in PostgreSQL, but apparently other systems have found success with it.

41m51s. Andy says that commercial systems can gather correlated statistics across tables, unlike PostgreSQL, where CREATE STATISTICS currently only works if all of the columns are in the same table.

48m17s. PostgreSQL gets panned for not supporting query hints. In a case like SELECT * FROM foo WHERE a = 1 ORDER BY b LIMIT 1 the optimizer might make the wrong choice about whether to use an index on a or an index on b, and a hint can fix it. Even if this is a subquery, it's being planned separately anyway, so the rest of the plan isn't going to get messed up by forcing a plan at this level. The discussion resumes at 52m41s.

58m15s. We discuss selectivity estimation for arbitrary expressions. Andy says SQL server can inline even procedural code, making it able to estimate selectivity for user-defined functions. Later he put out a tweet about this, with a link to a blog post mentioning some of the optimizations that can help in such scenarios. He also mentioned to me a paper on the topic called How Good Are Query Optimizers, Really?. I have unfortunately lost track of where some of the discussion on this point happened during the talk, but apparently SQL server handles selectivity estimation for arbitrary expressions by running them on a sample of the data and seeing what happens. That seems like it might be expensive, but it's easy to see how it could be a lot better than PostgreSQL's strategy of ... using an arbitrary constant.

This is all interesting stuff. I don't mind at all that PostgreSQL came in for some criticism in the course of the talk; I don't pretend that everything it does is state of the art, and I think it's smart to learn from the experience of others. "Not invented here" is a rarely a winning attitude.

In closing, I would like to give a shout-out to the Steven Moy Foundation for Keeping It Real, both for sponsoring this lecture series and also for having a very fun name. I would also like to again thank Andy Pavlo for the invitation. It was a great experience!

No comments:

Post a Comment