Thursday, April 22, 2010

Materialized Views in PostgreSQL

About 9 months ago, Peter Eisentraut reported on the results of a UserVoice survey he had set up. The purpose of the survey was to gather information from PostgreSQL users on which features they were most interested in seeing added to PostgreSQL. Of course, to what extent the features that got high marks in that survey represent the real interests of the community is questionable: certainly, not every PostgreSQL user (or potential user) reads Peter's blog. Still, the results seem to indicate a lot of interest in materialized views.

Recently, Pavel Baros made a proposed a possible implementation of materialized views on the pgsql-hackers mailing list as a Google Summer of Code project. What he's proposing to do is just about the simplest possible implementation of materialized views I can imagine, comprising just two commands:

CREATE MATERIALIZED VIEW name AS SELECT ...
ALTER MATERIALIZED VIEW name REFRESH;

Of course, right away this opens a can of worms. Basically, there are two things that you might mean when you talk about a "materialized view":

1. A materialized view that is always up to date. In other words, whenever the underlying tables are updated, there is some incremental-update process that concurrently modifies the materialized view data so that it remains up to date also. The tricky part is that it's not easy to update a materialized view of an arbitrary query for a reasonable cost. If the query is complex, the only feasible way to keep the materialized view up to date may be to re-execute the entire query whenever any of the underlying tables are modified (or when any of the functions or operators are redefined), which is probably not feasible for most people. But the plus side is that, when a cheap incremental update IS possible, you don't really need to know that you're working with a materialized view at all. It's indistinguishable from a regular view, up to performance.

2. A materialized view that isn't always up to date. This would include the proposed implementation, which is manual-refresh-only, as well as, for example, views that are refreshed on a regular schedule. (It's pretty easy to convert manual-refresh-only to refreshed-on-a-regular-schedule if you have access to cron!) Materialized views of this type are possibly useful in cases where incremental refresh isn't easily done, such as aggregates over a huge data set that produce a much smaller output table. Of course, you can do this kind of thing today using triggers, but maybe it would be nice to have something in the database that would do it for you automatically.

In the end, I suspect we'll want to support both of these features in PostgreSQL, but I'm curious which one people think is more useful. Our initial implementation of either feature is likely to be a bit rough around the edges, so don't assume we're going to provide all the optimizations we'd ever like to have on day one!

13 comments:

  1. #1 is a fundamentally profound addition to the the capabilities of the database engine.

    #2 sounds to me like ... less so.

    IIRC, the infrastructure that Microsoft built into MSSQL to support Indexed Views paved the way for other functionality, like Notification Services.

    If I had to pick which one of these things I'd like to see the core PostgreSQL team tackle, I pick #1.

    That said, I never had occasion to use Indexed Views in my years of abusing SQL Server. It was a cool sounding feature, I just never needed it.

    ReplyDelete
  2. The argument for MV's is that they are cheaper on read (and update, for that matter), and support complex join semantics without the performance hit when referenced.

    The notion of "cheap" conflicts with the actual implementation, of course. While I've never written any part of an engine, I'd say that the "cheap" way to implement MV's is through indirection. The MV at creation would run as a normal query. On completion, the MV would consist of a set of RowId's (by whatever name). The engine then has the task of managing the delta set of rows with respect to the MV's query; removing when the underlying table deletes and adding when inserted. Updates come out in the wash, which is why the MV is considered "cheap".

    From a practical point of view, such an implementation would make sense for MV's which are heavily updated, but not heavily modified. And therefore, MV's would make sense for such join queries.

    ReplyDelete
  3. #2 is only a shorthand notation for "create table as select ..." : the result set is stored in a table(-space) and can be queried, referenced by synonyms, granted etc just like normal views and tables. "Alter materialized view ... refresh" is shorthand notation for transactionally (i hope) replacing the entire content of the table (optimized "begin tx; drop * from mview; insert ...; insert ...; commit;").

    While simple, these functions are surprisingly useful for lots of business analysis work, collecting data from several system's schemas and performing select/joins across data that are collected from different sources and already partially adapted.

    One caveat is that separate systems tend to have separate database instances. This will be overcome when postgres adds database links as well (will that be version 10?). The combination of these two rather simple features (if you disallow select joins across links, the latter is actually only to integrate the client driver into the server) are amazingly powerful for all kinds of readonly integration scenarios and data warehousing.

    #1 is an esoteric feature. Everybody knows only microsoft allows update of views (it's been like that for decades already). It is cool, I admit, but still it seems to have very little use.

    As an experienced user: If you ever add automatic (timed) refresh of mviews to postgres, don't ever strive for compatibility with oracle's syntax and mode of operation. It's cruel.

    ReplyDelete
  4. @erikma
    Everybody knows only microsoft allows update of views.</quote
    I doubt that is actually true. Also, updatable views specified in the SQL:2003 standard.

    ReplyDelete
  5. -- updatable views specified in the SQL:2003 standard.

    ... and by Codd in the Relational Model, although I don't remember which paper off the top of my head.

    DB2, and others, have supported updateable views defined as projections/restrictions on a single table for years.

    ReplyDelete
  6. From my point of view, the real advantage of MVs is only seen when the engine can rewrite queries that access underlying tables and thus can speed up existing queries without the need to rewrite the application.

    This "magic" is what is so impressive about the MV support in Oracle.

    Unless the engine supports transparent rewriting I don't really see the advantage of MVs.

    ReplyDelete
  7. I think that most users (at least the ones I know) would like to have something like the Materialized Views provided by Oracle (up to date if the user decides so).
    The "not always up to date" MV can be useful too, but also confusing for users if they await something else.

    ReplyDelete
  8. As a long time SQL Server DBA, live updating materialized views make it possible to perform a large amount of analytical processing without a massive investment in SQL Server Integration Services, triggers, summary jobs, or mastering SQL Server Analysis Services and MDX/DMX.

    Implementation #2 is some neat syntactic sugar, but it is, as eirikima hints, just that: syntactic sugar.

    Personally, I make substantial use of indexed views in SQL Server to provide a large number of convenience aggregations. I would think that this feature adds a lot to PostgreSQL, especially if, as shammat mentions, the engine is able to re-write queries to use the materialized view. Within SQL Server, this alone makes materialized views worth some of the configuration trouble.

    ReplyDelete
  9. Agreed with most of the comments above. What I would ideally like to see would be a simple set of refresh trigger options.

    CREATE MATERIALIZED VIEW name AS SELECT ... REFRESH ON OPTION

    Refresh options I think would be useful are:
    * QUERY: Everytime the MV is queried

    * CHANGE: Any of the underlying data tables is changed

    * NONE: MV never updates (default option)

    * TIME: Updates on a scheduled basis. TIME requires an interval (in days) and optionally a starting time (uses current time if not specified). For example: "TIME 1/24" would be once an hour starting now; "TIME 1 2010-05-14T16:00" would be once a day starting at 16:00 on 2010-05-14.

    ReplyDelete
  10. If implementing #1,-
    Would the materialized data be maintained within the user transaction?
    If so, would it be necessary for the user to wait until all aggregates are maintained before getting the confirmation of the commit?
    If not being a part of the user transaction, how could it be possible to grantee consistency?
    #2 seems to be a lot simpler to implement, letting the application developer decide when to use server power to maintain the materialized data, and, doing it totally separated from the user transaction.

    ReplyDelete
  11. If implementing #1,-
    Would the materialized data be maintained within the user transaction?
    If so, would it be necessary for the user to wait until all aggregates are maintained before getting the confirmation of the commit?
    If not being a part of the user transaction, how could it be possible to grantee consistency?
    #2 seems to be a lot simpler to implement, letting the application developer decide when to use server power to maintain the materialized data, and, doing it totally separated from the user transaction.

    ReplyDelete
  12. Second to shammat.
    There is no point in implementing MV without query rewrite capabilities in optimizer.

    ReplyDelete
  13. I think materialized views fit the needs of people currently migrating to "NoSQL" due to its Map/Reduce capabilities.

    In CouchDB the "map" stage is basically a manifest view of the table with an index and "reduce" is an aggregation of that view with an index.

    Now in CouchDB your map "view" can't do any joins so this isn't quite the same as a full implementation of Materialized views but it is still very useful.

    ReplyDelete