Re: Materialized views WIP patch

Lists: pgsql-hackers
From: "Kevin Grittner" <kgrittn(at)mail(dot)com>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-16 16:25:57
Message-ID: 20121116162558.90150@gmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:

>> 1. CREATE MATERIALIZED VIEW syntax is stolen directly from CREATE
>> TABLE AS, with all the same clauses supported. That includes
>> declaring a materialized view to be temporary or unlogged.
>
> What use would a temporary matview be?

It would be essentially like a temporary table, with all the same
persistence options. I'm not really sure how often it will be more
useful than a temporary table before we have incremental maintenance
of materialized views; once we have that, though, it seems likely
that there could be reasonable use cases.

> Unlogged is good.

I agree that there are likely to be more use cases for this than
temp MVs. Unfortunately, I've had a hard time figuring out how to
flag an MV which is empty because its contents were lost after a
crash with preventing people from using an MV which hasn't been
populated, which has the potential to silently return incorrect
results.

>> 2. MVs don't support inheritance.
>
> In which direction? Can't inherit, or can't be inherited from?

The table inheritance has not been implemented in either direction
for MVs. It didn't seem clear to me that there were reasonable use
cases. Do you see any?

>> 9. MVs can't directly be used in a COPY statement, but can be the
>> source of data using a SELECT.
>
> Hmmm? I don't understand the reason for this.

Consistency. There are other object types which seem to enforce this
rule for no reason that I can see beyond maybe a desire to have both
directions of COPY work with the same set of objects. If I remember
correctly, allowing this would eliminate one line of code from the
patch, so if sentiment is in favor of it, it is very easily done.

>> 13. pg_class now has a relisvalid column, which is true if an MV is
>> truncated or created WITH NO DATA. You can not scan a relation
>> flagged as invalid.
>
> What error would a user see?

I can more directly answer that on Monday. If you enable the body of
the function which makes the relisvalid check you can see the messages.
I commented it out because I have not yet figured out how to suppress
the check for a LOAD MV command.

>> 14. ALTER MATERIALIZED VIEW is supported for the options that seemed
>> to make sense. For example, you can change the tablespace or
>> schema, but you cannot add or drop column with ALTER.
>
> How would you change the definition of an MV then?

At this point you would need to drop and re-create the MV. If we
want to add columns to an MV or change what an existing column holds,
perhaps we could have an ALTER MV which changed the SELECT statement
that populates the MV? I would prefer to leave that to a later patch,
though -- it seems like a bit of a minefield compared to what is
being implemented in this patch.

>> 16. To get new data into the MV, the command is LOAD MATERIALIZED
>> VIEW mat view_name. This seemed more descriptive to me that the
>> alternatives and avoids declaring any new keywords beyond
>> MATERIALIZED. If the MV is flagged as relisvalid == false, this
>> will change it to true.
>
> UPDATE MATERIALIZED VIEW was problematic?

Not technically, really, but I saw two reasons that I preferred LOAD MV:

1. It seems to me to better convey that the entire contents of the MV
  will be built from scratch, rather than incrementally adjusted.
2. We haven't hashed out the syntax for more aggressive maintenance of
  an MV, and it seemed like UPDATE MV might be syntax we would want to
  use for something which updated selected parts of an MV when we do.

> Does LOAD automatically TRUNCATE the view before reloading it? If not,
> why not?

It builds a new heap and moves it into place. When the transaction
running LMV commits, the old heap is deleted. In implementation it is
closer to CLUSTER or the new VACUUM FULL than TRUNCATE followed by
creating a new table. This allows all permissions, etc., to stay in
place.

>> It would be good to have some discussion to try to reach a consensus
>> about whether we need to differentiate between *missing* datat (where
>> a materialized view which has been loaded WITH NO DATA or TRUNCATEd
>> and has not been subsequently LOADed) and potentially *stale* data.
>> If we don't care to distinguish between a view which generated no
>> rows when it ran and a one for which the query has not been run, we
>> can avoid adding the relisvalid flag, and we could support UNLOGGED
>> MVs. Perhaps someone can come up with a better solution to that
>> problem.
>
> Hmmm. I understand the distinction you're making here, but I'm not sure
> it actually matters to the user. MVs, by their nature, always have
> potentially stale data. Being empty (in an inaccurate way) is just one
> kind of stale data.

Robert feels the same way, but I disagree. Some MVs will not be terribly
volatile. In my view there is a big difference between having a "top ten"
list which might be based on yesterday's base tables rather than the base
table states as of this moment, and having a "top ten" list with no
entries. If you want to, for example, take some action if an order comes
in for one of your top customers, and a different path for other
customers, suddenly treating all of your long-time top customers as not
being so, without any squawk from the database, seems dangerous.

> It would be nice for the user to have some way to know that a matview is
> empty due to never being LOADed or recently being TRUNCATEd. However, I
> don't think that relisvalid flag -- and preventing scanning the relation
> -- is a good solution. What I'd rather have instead is a timestamp of
> when the MV was last LOADed. If the MV was never loaded (or was
> truncated) that timestamp would be NULL. Such a timestamp would allow
> users to construct all kinds of ad-hoc refresh schemes for MVs which
> would not be possible without it.

I see your point there; I'll think about that. My take was more that MVs
would often be refreshed by crontab, and that you would want to keep
subsequent steps from running and generating potentially plausible but
completely inaccurate results if the LMV failed.

> I don't see how this relates to UNLOGGED matviews either way.

UNLOGGED tables and indexes are made empty on crash recovery by copying
the initialization fork over the "normal" relations. Care was taken to
avoid needing to connect to each database in turn to complete that
recovery. This style of recovery can't really set the relisvalid flag, as
far as I can see; which leaves us choosing between unlogged MVs and
knowing whether they hold valid data -- unless someone has a better idea.

-Kevin


From: Thom Brown <thom(at)linux(dot)com>
To: Kevin Grittner <kgrittn(at)mail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Materialized views WIP patch
Date: 2012-11-16 17:04:43
Message-ID: CAA-aLv6pF=yrKqR85fp=2utvf5yLZNZhc6=Rf799SpMkhy+dRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16 November 2012 16:25, Kevin Grittner <kgrittn(at)mail(dot)com> wrote:

> Josh Berkus wrote:
>
> > Unlogged is good.
>
> I agree that there are likely to be more use cases for this than
> temp MVs. Unfortunately, I've had a hard time figuring out how to
> flag an MV which is empty because its contents were lost after a
> crash with preventing people from using an MV which hasn't been
> populated, which has the potential to silently return incorrect
> results.
>

pg_class.relisvalid = false.. getting rid of the use for truncated MVs?

--
Thom


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <kgrittn(at)mail(dot)com>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-16 17:08:26
Message-ID: 16871.1353085706@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <kgrittn(at)mail(dot)com> writes:
> Josh Berkus wrote:
>> What use would a temporary matview be?

> It would be essentially like a temporary table, with all the same
> persistence options. I'm not really sure how often it will be more
> useful than a temporary table before we have incremental maintenance
> of materialized views; once we have that, though, it seems likely
> that there could be reasonable use cases.

One of the principal attributes of a temp table is that its contents
aren't (reliably) accessible from anywhere except the owning backend.
Not sure where you're going to hide the incremental maintenance in
that scenario.

> The table inheritance has not been implemented in either direction
> for MVs. It didn't seem clear to me that there were reasonable use
> cases. Do you see any?

We don't have inheritance for views, so how would we have it for
materialized views?

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Kevin Grittner <kgrittn(at)mail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-16 21:20:02
Message-ID: 50A6AE02.7000102@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin,

> I agree that there are likely to be more use cases for this than
> temp MVs. Unfortunately, I've had a hard time figuring out how to
> flag an MV which is empty because its contents were lost after a
> crash with preventing people from using an MV which hasn't been
> populated, which has the potential to silently return incorrect
> results.

See below.

>>> 2. MVs don't support inheritance.
>>
>> In which direction? Can't inherit, or can't be inherited from?
>
> The table inheritance has not been implemented in either direction
> for MVs. It didn't seem clear to me that there were reasonable use
> cases. Do you see any?

No, I just wanted clarity on this. I can see a strong case for
eventually supporting CREATE MATERIALIZED VIEW matview_1 LIKE matview,
in order to "copy" mativews, though.

> Consistency. There are other object types which seem to enforce this
> rule for no reason that I can see beyond maybe a desire to have both
> directions of COPY work with the same set of objects. If I remember
> correctly, allowing this would eliminate one line of code from the
> patch, so if sentiment is in favor of it, it is very easily done.

There's going to be a pretty strong demand for COPY FROM matviews.
Forcing the user to use COPY FROM ( SELECT ... ) will be seen as
arbitrary and unintuitive.

>> How would you change the definition of an MV then?
>
> At this point you would need to drop and re-create the MV. If we
> want to add columns to an MV or change what an existing column holds,
> perhaps we could have an ALTER MV which changed the SELECT statement
> that populates the MV? I would prefer to leave that to a later patch,
> though -- it seems like a bit of a minefield compared to what is
> being implemented in this patch.

I agree that it should be a later patch.

> Not technically, really, but I saw two reasons that I preferred LOAD MV:
>
> 1. It seems to me to better convey that the entire contents of the MV
> will be built from scratch, rather than incrementally adjusted.
> 2. We haven't hashed out the syntax for more aggressive maintenance of
> an MV, and it seemed like UPDATE MV might be syntax we would want to
> use for something which updated selected parts of an MV when we do.

Hmmm, I see your point. So "LOAD" would recreate, and (when supported)
UPDATE would incrementally update?

> It builds a new heap and moves it into place. When the transaction
> running LMV commits, the old heap is deleted. In implementation it is
> closer to CLUSTER or the new VACUUM FULL than TRUNCATE followed by
> creating a new table. This allows all permissions, etc., to stay in
> place.

OK, so same effect as a truncate.

> Robert feels the same way, but I disagree. Some MVs will not be terribly
> volatile. In my view there is a big difference between having a "top ten"
> list which might be based on yesterday's base tables rather than the base
> table states as of this moment, and having a "top ten" list with no
> entries. If you want to, for example, take some action if an order comes
> in for one of your top customers, and a different path for other
> customers, suddenly treating all of your long-time top customers as not
> being so, without any squawk from the database, seems dangerous.

Right, but a relisvalid flag just tells me that the matview was updated
at sometime in the past, and not *when* it was updated. It could have
been 3 years ago. The fact that it was updated at some indefinite time
is fairly valueless information.

There's a rule in data warehousing which says that it's better to have
no data (and know that you have no data) than to have incorrect data.

> I see your point there; I'll think about that. My take was more that MVs
> would often be refreshed by crontab, and that you would want to keep
> subsequent steps from running and generating potentially plausible but
> completely inaccurate results if the LMV failed.

Yeah, that too. Also, a timestamp it would make it easy to double-check
if the cron job was failing or had been disabled.

> UNLOGGED tables and indexes are made empty on crash recovery by copying
> the initialization fork over the "normal" relations. Care was taken to
> avoid needing to connect to each database in turn to complete that
> recovery. This style of recovery can't really set the relisvalid flag, as
> far as I can see; which leaves us choosing between unlogged MVs and
> knowing whether they hold valid data -- unless someone has a better idea.

Yeah, well, whether we have relisvalid or mvlastupdate, we're going to
have to work out some way to have that field react to changes to the
table overall. I don't know *how*, but it's something we'll have to solve.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: "Kevin Grittner" <kgrittn(at)mail(dot)com>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-16 22:12:21
Message-ID: m27gplyz7e.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <kgrittn(at)mail(dot)com> writes:
>> UPDATE MATERIALIZED VIEW was problematic?
>
> Not technically, really, but I saw two reasons that I preferred LOAD MV:
>
> 1. It seems to me to better convey that the entire contents of the MV
>   will be built from scratch, rather than incrementally adjusted.
> 2. We haven't hashed out the syntax for more aggressive maintenance of
>   an MV, and it seemed like UPDATE MV might be syntax we would want to
>   use for something which updated selected parts of an MV when we do.

Good point, and while I'm in the mood for some grammar input, here's a
try:

ALTER MATERIALIZED VIEW foo RESET;
ALTER MATERIALIZED VIEW foo UPDATE;

I think such wholesale operations make more sense as ALTER statement
than as UPDATE statements.

> It builds a new heap and moves it into place. When the transaction
> running LMV commits, the old heap is deleted. In implementation it is
> closer to CLUSTER or the new VACUUM FULL than TRUNCATE followed by
> creating a new table. This allows all permissions, etc., to stay in
> place.

When you say closer to CLUSTER, do you include the Access Exclusive Lock
that forbids reading the previous version's data while you prepare the
new one? That would be very bad and I wouldn't understand the need to,
in the scope of MATERIALIZED VIEWs which are by definition lagging
behind…

If as I think you don't have that limitation in your implementation,
it's awesome and just what I was hoping to read :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-16 22:13:01
Message-ID: 22182.1353103981@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> There's going to be a pretty strong demand for COPY FROM matviews.
> Forcing the user to use COPY FROM ( SELECT ... ) will be seen as
> arbitrary and unintuitive.

You could make that same claim about plain views, but in point of
fact the demand for making them work in COPY has been minimal.
So I'm not convinced this is an essential first-cut feature.
We can always add it later.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-17 01:41:55
Message-ID: 50A6EB63.9020208@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> You could make that same claim about plain views, but in point of
> fact the demand for making them work in COPY has been minimal.
> So I'm not convinced this is an essential first-cut feature.
> We can always add it later.

Of course. I just had the impression that we could support COPY FROM by
*deleting* a couple lines from Kevin's patch, rather than it being extra
work.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-17 15:48:21
Message-ID: 8967.1353167301@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> You could make that same claim about plain views, but in point of
>> fact the demand for making them work in COPY has been minimal.
>> So I'm not convinced this is an essential first-cut feature.
>> We can always add it later.

> Of course. I just had the impression that we could support COPY FROM by
> *deleting* a couple lines from Kevin's patch, rather than it being extra
> work.

Even if it happens to be trivial in the current patch, it's an added
functional requirement that we might later regret having cavalierly
signed up for. And, as noted upthread, relations that support only
one direction of COPY don't exist at the moment; that would be adding
an asymmetry that we might later regret, too.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kevin Grittner <kgrittn(at)mail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Materialized views WIP patch
Date: 2012-11-18 11:37:58
Message-ID: CA+U5nMLcz2X_zR2nxxAF03ETW1oxp7TyO3+Vs1ZqKMfp9KZxGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16 November 2012 11:25, Kevin Grittner <kgrittn(at)mail(dot)com> wrote:

>>> 16. To get new data into the MV, the command is LOAD MATERIALIZED
>>> VIEW mat view_name. This seemed more descriptive to me that the
>>> alternatives and avoids declaring any new keywords beyond
>>> MATERIALIZED. If the MV is flagged as relisvalid == false, this
>>> will change it to true.
>>
>> UPDATE MATERIALIZED VIEW was problematic?
>
> Not technically, really, but I saw two reasons that I preferred LOAD MV:
>
> 1. It seems to me to better convey that the entire contents of the MV
> will be built from scratch, rather than incrementally adjusted.
> 2. We haven't hashed out the syntax for more aggressive maintenance of
> an MV, and it seemed like UPDATE MV might be syntax we would want to
> use for something which updated selected parts of an MV when we do.
>
>> Does LOAD automatically TRUNCATE the view before reloading it? If not,
>> why not?
>
> It builds a new heap and moves it into place. When the transaction
> running LMV commits, the old heap is deleted. In implementation it is
> closer to CLUSTER or the new VACUUM FULL than TRUNCATE followed by
> creating a new table. This allows all permissions, etc., to stay in
> place.

This seems very similar to the REPLACE command we discussed earlier,
except this is restricted to Mat Views.

If we're going to have this, I would prefer a whole command.

e.g. REPLACE matviewname REFRESH

that would also allow

REPLACE tablename AS query

Same thing under the covers, just more widely applicable and thus more useful.

Either way, I don't much like overloading the use of LOAD, which
already has a very different meaning.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Kevin Grittner *EXTERN*" <kgrittn(at)mail(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Materialized views WIP patch
Date: 2012-11-19 08:41:37
Message-ID: D960CB61B694CF459DCFB4B0128514C208B87E13@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
>> It would be nice for the user to have some way to know that a matview is
>> empty due to never being LOADed or recently being TRUNCATEd. However, I
>> don't think that relisvalid flag -- and preventing scanning the relation
>> -- is a good solution. What I'd rather have instead is a timestamp of
>> when the MV was last LOADed. If the MV was never loaded (or was
>> truncated) that timestamp would be NULL. Such a timestamp would allow
>> users to construct all kinds of ad-hoc refresh schemes for MVs which
>> would not be possible without it.

+1

Kevin Grittner wrote:
> I see your point there; I'll think about that. My take was more that MVs
> would often be refreshed by crontab, and that you would want to keep
> subsequent steps from running and generating potentially plausible but
> completely inaccurate results if the LMV failed.

If one of these subsequent steps doesn't care if refresh
failed once, it shouldn't be forced to fail. I imagine
that for many applications yesterday's data can be good enough.

Those that care should check the timestamp.

Yours,
Laurenz Albe