Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.

Lists: pgsql-committerspgsql-hackers
From: Kevin Grittner <kgrittn(at)postgresql(dot)org>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-16 18:29:58
Message-ID: E1UzA0c-0007ir-J7@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.

This allows reads to continue without any blocking while a REFRESH
runs. The new data appears atomically as part of transaction
commit.

Review questioned the Assert that a matview was not a system
relation. This will be addressed separately.

Reviewed by Hitoshi Harada, Robert Haas, Andres Freund.
Merged after review with security patch f3ab5d4.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/cc1965a99bf87005f431804bbda0f723887a04d6

Modified Files
--------------
doc/src/sgml/mvcc.sgml | 3 +-
doc/src/sgml/ref/refresh_materialized_view.sgml | 34 +-
src/backend/commands/cluster.c | 27 +-
src/backend/commands/matview.c | 524 +++++++++++++++++++++--
src/backend/commands/tablecmds.c | 3 +-
src/backend/executor/execMain.c | 10 +-
src/backend/executor/nodeModifyTable.c | 5 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/equalfuncs.c | 1 +
src/backend/parser/gram.y | 7 +-
src/bin/psql/tab-complete.c | 17 +
src/include/commands/cluster.h | 3 +-
src/include/commands/matview.h | 2 +
src/include/nodes/parsenodes.h | 1 +
src/test/regress/expected/matview.out | 38 +-
src/test/regress/sql/matview.sql | 29 +-
16 files changed, 646 insertions(+), 59 deletions(-)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Grittner <kgrittn(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-17 14:11:28
Message-ID: 5876.1374070288@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Kevin Grittner <kgrittn(at)postgresql(dot)org> writes:
> Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.

The buildfarm members that use -DCLOBBER_CACHE_ALWAYS say this patch
is broken.

regards, tom lane


From: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <kgrittn(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-18 04:39:37
Message-ID: CAP7Qgmk46XrhV-ycGZPknVG0pZnT0hO46NysQty_R7g1bNzCiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Wed, Jul 17, 2013 at 7:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Kevin Grittner <kgrittn(at)postgresql(dot)org> writes:
>> Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
>
> The buildfarm members that use -DCLOBBER_CACHE_ALWAYS say this patch
> is broken.
>

Looks like rd_indpred is not correct if index relation is fresh.
Something like this works for me.

diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index edd34ff..46149ee 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -634,7 +634,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid)

/* Skip partial indexes. */
indexRel = index_open(index->indexrelid,
RowExclusiveLock);
- if (indexRel->rd_indpred != NIL)
+ if (RelationGetIndexPredicate(indexRel) != NIL)
{
index_close(indexRel, NoLock);
ReleaseSysCache(indexTuple);

--
Hitoshi Harada


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: Kevin Grittner <kgrittn(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-18 04:51:36
Message-ID: 28882.1374123096@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> writes:
> Looks like rd_indpred is not correct if index relation is fresh.
> Something like this works for me.

> - if (indexRel->rd_indpred != NIL)
> + if (RelationGetIndexPredicate(indexRel) != NIL)

Hm, yeah, the direct access to rd_indpred is certainly wrong.
Will apply, thanks!

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <kgrittn(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-22 22:01:50
Message-ID: 20130722220150.GF752@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 2013-07-17 10:11:28 -0400, Tom Lane wrote:
> Kevin Grittner <kgrittn(at)postgresql(dot)org> writes:
> > Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
>
> The buildfarm members that use -DCLOBBER_CACHE_ALWAYS say this patch
> is broken.

Jagarundi still tells that story. At least something like the following
patch seems to be required.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
fix-matview.patch text/x-patch 1.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-22 22:11:02
Message-ID: CA+TgmoZSwfAAF0wOhCXvALcNYzgVvDqq25ekHwHjJW=EKJCAgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Jul 22, 2013 at 6:01 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-07-17 10:11:28 -0400, Tom Lane wrote:
>> Kevin Grittner <kgrittn(at)postgresql(dot)org> writes:
>> > Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
>>
>> The buildfarm members that use -DCLOBBER_CACHE_ALWAYS say this patch
>> is broken.
>
> Jagarundi still tells that story. At least something like the following
> patch seems to be required.

Thanks, I was noticing that breakage today, too.

I committed this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <kgrittn(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-22 23:09:13
Message-ID: 28546.1374534553@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2013-07-17 10:11:28 -0400, Tom Lane wrote:
>> The buildfarm members that use -DCLOBBER_CACHE_ALWAYS say this patch
>> is broken.

> Jagarundi still tells that story.

Uh, no. Jagarundi was perfectly happy for several build cycles after
I committed 405a468. The buildfarm history shows that it started
complaining again after this change:

f01d1ae Add infrastructure for mapping relfilenodes to relation OIDs

> At least something like the following patch seems to be required.

This does look like a necessary change, but I suspect there is also
something rotten in f01d1ae.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <kgrittn(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-22 23:18:58
Message-ID: 20130722231858.GB22776@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 2013-07-22 19:09:13 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2013-07-17 10:11:28 -0400, Tom Lane wrote:
> >> The buildfarm members that use -DCLOBBER_CACHE_ALWAYS say this patch
> >> is broken.
>
> > Jagarundi still tells that story.
>
> Uh, no. Jagarundi was perfectly happy for several build cycles after
> I committed 405a468. The buildfarm history shows that it started
> complaining again after this change:
>
> f01d1ae Add infrastructure for mapping relfilenodes to relation OIDs

Huh? That's rather strange. No code from that patch is even exececuted
until the alter_table regression tests way later. So I can't see how
that patch could cause the matview error. Which is pretty clearly using
an freed relcache entry.
I think this may be a timing issue.

> > At least something like the following patch seems to be required.
>
> This does look like a necessary change, but I suspect there is also
> something rotten in f01d1ae.

The alter table failure lateron seems to be "ERROR: relation "tmp"
already exists", which just seems to be a consequence of incomplete
cleanup due to the earlier crashes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <kgrittn(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-23 00:38:53
Message-ID: 20130723003853.GA21996@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 2013-07-23 00:01:50 +0200, Andres Freund wrote:
> On 2013-07-17 10:11:28 -0400, Tom Lane wrote:
> > Kevin Grittner <kgrittn(at)postgresql(dot)org> writes:
> > > Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
> >
> > The buildfarm members that use -DCLOBBER_CACHE_ALWAYS say this patch
> > is broken.
>
> Jagarundi still tells that story. At least something like the following
> patch seems to be required.

Hm. There seems to be more things that need some more improvement from a
quick look.

First, I have my doubts of the general modus operandy of CONCURRENTLY
here. I think in many, many cases it would be faster to simply swap in
the newly built heap + indexes in concurrently. I think I have seen that
mentioned in the review while mostly skipping the discussion, but still.
That would be easy enough as of Robert's mvcc patch.

* Shouldn't the GetUserIdAndSecContext(), NewGUCNestLevel() dance be
done a good bit earlier? We're executing queries before that.
* The loop over indexes in refresh_by_match_merge should index_open(ExclusiveLock) the
indexes initially instead of searching the syscache manually. They are
opened inside the loop in many cases anyway. There probably aren't any
hazards currently, but I am not even sure about that. The index_open()
in the loop at the very least processes the invalidation messages
other backends send...
I'd even suggest using BuildIndexInfo() or such on the indexes, then
you could use ->ii_Expressions et al instead of peeking into indkeys
by hand.
* All not explicitly looked up operators should be qualified using
OPERATOR(). It's easy to define your own = operator for tid and set
the search_path to public,pg_catalog. Now, the whole thing is
restricted to talbe owners afaics, making this decidedly less dicey,
but still. But if anyobdy actually uses a search_path like the above
you can easily hurt them.
* Same seems to hold true for the y = x and y.* IS DISTINCT FROM x.*
operations.
* (y.*) = (x.*) also needs to do proper operator lookups.
* OpenMatViewIncrementalMaintenance() et al seem to be underdocumented.
* why is it even allowed that matview_maintenance_depth is > 1? Unless I
miss something there doesn't seem to be support for a recursive
refresh whatever that would mean?
* INSERT and DELETE should also alias the table names, to make sure there cannot
be issues with the nested queries.
* I'd strongly suggest more descriptive table aliases than x, y,
d. Those make the statements rather hard to parse.
* SPI_exec() is deprecated in favor of SPI_execute()
* ISTM deferred uniqueness checks and similar things should be enforced
to run ub refresh_by_match_merge()

I think this patch/feature needs a good bit of work...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <kgrittn(at)postgresql(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-23 16:27:34
Message-ID: 1374596854.54580.YahooMailNeo@web162904.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Hm. There seems to be more things that need some more improvement
> from a quick look.
>
> First, I have my doubts of the general modus operandy of
> CONCURRENTLY here. I think in many, many cases it would be faster
> to simply swap in the newly built heap + indexes in concurrently.
> I think I have seen that mentioned in the review while mostly
> skipping the discussion, but still. That would be easy enough as
> of Robert's mvcc patch.

Yeah, in many cases you would not want to choose to specify this
technique. The point was to have a technique which could run
without blocking while a heavy load of queries (including perhaps a
very long-running query) was executing.  If subsequent events
provide an alternative way to do that, it's worth looking at it;
but my hope was to get this out of the way to allow time to develop
incremental maintenance of matviews for the next CF.  If we spend
too much time reworking this to micro-optimize it for new patches,
incremental maintenance might not happen for 9.4.

The bigger point is perhaps syntax.  If MVCC catalogs are really at
a point where we can substitute a new heap and new indexes for a
matview without taking an AccessExclusiveLock, then we should just
make the current code do that.  I think there is still a place for
a differential update for large matviews which only need small
changes on a REFRESH, but perhaps the right term for that is not
CONCURRENTLY.

> * Shouldn't the GetUserIdAndSecContext(), NewGUCNestLevel() dance
>   be done a good bit earlier? We're executing queries before
>   that.

This can't be in effect while we are creating temp tables.  If
we're going to support a differential REFRESH technique, it must be
done more "surgically" than to wrap the whole REFRESH statement
execution.

> * The loop over indexes in refresh_by_match_merge should
>   index_open(ExclusiveLock) the indexes initially instead of
>   searching the syscache manually. They are opened inside the
>   loop in many cases anyway. There probably aren't any hazards
>   currently, but I am not even sure about that. The index_open()
>   in the loop at the very least processes the invalidation
>   messages other backends send...

Fair enough.  That seems like a simple change.

>   I'd even suggest using BuildIndexInfo() or such on the indexes,
>   then you could use ->ii_Expressions et al instead of peeking
>   into indkeys by hand.

Given that the function is in a source file described as containing
"code to create and destroy POSTGRES index relations" and the
comments for that function say that it "stores the information
about the index that's needed by FormIndexDatum, which is used for
both index_build() and later insertion of individual index tuples,"
and we're not going to create or destroy an index or call
FormIndexDatum or insert individual index tuples from this code, it
would seem to be a significant expansion of the charter of that
function.  What benefits do you see to using that level?

> * All not explicitly looked up operators should be qualified
>   using OPERATOR(). It's easy to define your own = operator for
>   tid and set the search_path to public,pg_catalog. Now, the
>   whole thing is restricted to talbe owners afaics, making this
>   decidedly less dicey, but still. But if anyobdy actually uses a
>   search_path like the above you can easily hurt them.
> * Same seems to hold true for the y = x and y.* IS DISTINCT FROM
>   x.* operations.
> * (y.*) = (x.*) also needs to do proper operator lookups.

I wasn't aware that people could override the equality operators
for tid and RECORD, so that seemed like unnecessary overhead.  I
can pull the operators from the btree AM if people can do that.

> * OpenMatViewIncrementalMaintenance() et al seem to be
>   underdocumented.

If the adjacent comment for
MatViewIncrementalMaintenanceIsEnabled() left you at all uncertain
about what OpenMatViewIncrementalMaintenance() and
CloseMatViewIncrementalMaintenance() are for, I can add comments.
At the time I wrote it, it seemed to me to be redundant to have
such comments, and would cause people to waste more time looking at
them then would be saved by anything they could add to what the
function names and one-line function bodies conveyed; but if there
was doubt in your mind about when they should be called, I'll add
something.

Would it help to move the static functions underneath the
(documented) public function and its comment?

> * why is it even allowed that matview_maintenance_depth is > 1?
>   Unless I miss something there doesn't seem to be support for a
>   recursive refresh whatever that would mean?

I was trying to create the function in a way that would be useful
for incremental maintenance, too.  Those could conceivably recurse.
Also, this way bugs in usage are more easily detected.

> * INSERT and DELETE should also alias the table names, to make
>   sure there cannot be issues with the nested queries.

I don't see the hazard -- could you elaborate?

> * I'd strongly suggest more descriptive table aliases than x, y,
>   d. Those make the statements rather hard to parse.

I guess.  Those are intended to be internal, but I guess there's no
reason not to be more verbose in the aliases.

> * SPI_exec() is deprecated in favor of SPI_execute()

According to what?  The documentation of SPI_exec doesn't say
anything about it being deprecated; it does say:

| SPI_exec is the same as SPI_execute, with the latter's read_only
| parameter always taken as false.

I didn't see any particular benefit to using the more verbose form,
but it's certainly easy enough to change if people don't like the
shorter form.

> * ISTM deferred uniqueness checks and similar things should be
>   enforced to run ub refresh_by_match_merge()

"ub"?  That looks like a typo, but I'm not sure what you mean.
Could you show a case where there is a hazard?

There was also a mention of a temp table left around.  I had
intended to specify ON COMMIT DROP for the internally used temp
tables, but apparently lost track of that.

I'll wait for the thread to settle down to offer a patch (or set of
patches).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)postgresql(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-31 01:39:14
Message-ID: 20130731013914.GD19053@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hi Kevin,

On 2013-07-23 09:27:34 -0700, Kevin Grittner wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> > Hm. There seems to be more things that need some more improvement
> > from a quick look.
> >
> > First, I have my doubts of the general modus operandy of
> > CONCURRENTLY here. I think in many, many cases it would be faster
> > to simply swap in the newly built heap + indexes in concurrently.
> > I think I have seen that mentioned in the review while mostly
> > skipping the discussion, but still. That would be easy enough as
> > of Robert's mvcc patch.
>
> Yeah, in many cases you would not want to choose to specify this
> technique. The point was to have a technique which could run
> without blocking while a heavy load of queries (including perhaps a
> very long-running query) was executing.  If subsequent events
> provide an alternative way to do that, it's worth looking at it;
> but my hope was to get this out of the way to allow time to develop
> incremental maintenance of matviews for the next CF.

I thought there was a pretty obvious way to do this concurrently. But
after mulling over it for a few days I either miss a connection I made
previously or, more likely, I have missed some fundamental difficulties
previously.

I'd would be fairly easy if indexes were attached to a relations
relfilenode, not a relations oid... But changing that certainly doesn't
fall in the "easy enough" category anymore.

> The bigger point is perhaps syntax.  If MVCC catalogs are really at
> a point where we can substitute a new heap and new indexes for a
> matview without taking an AccessExclusiveLock, then we should just
> make the current code do that.  I think there is still a place for
> a differential update for large matviews which only need small
> changes on a REFRESH, but perhaps the right term for that is not
> CONCURRENTLY.

Hm. Yes. Especially as any CONCURRENTLY implementation building and
swapping in a new heap that I can think of requires to be run outside a
transaction (like CIC et al).

> > * Shouldn't the GetUserIdAndSecContext(), NewGUCNestLevel() dance
> >   be done a good bit earlier? We're executing queries before
> >   that.
>
> This can't be in effect while we are creating temp tables.  If
> we're going to support a differential REFRESH technique, it must be
> done more "surgically" than to wrap the whole REFRESH statement
> execution.

Sure, it cannot be active when creating the temp table. But it's
currently inactive while you SPI_execute queries. That can't be right.

> >   I'd even suggest using BuildIndexInfo() or such on the indexes,
> >   then you could use ->ii_Expressions et al instead of peeking
> >   into indkeys by hand.
>
> Given that the function is in a source file described as containing
> "code to create and destroy POSTGRES index relations" and the
> comments for that function say that it "stores the information
> about the index that's needed by FormIndexDatum, which is used for
> both index_build() and later insertion of individual index tuples,"
> and we're not going to create or destroy an index or call
> FormIndexDatum or insert individual index tuples from this code, it
> would seem to be a significant expansion of the charter of that
> function.  What benefits do you see to using that level?

I'd prevent you from needing to peek into indkeys. Note that it's
already used in various places.

> > * All not explicitly looked up operators should be qualified
> >   using OPERATOR(). It's easy to define your own = operator for
> >   tid and set the search_path to public,pg_catalog. Now, the
> >   whole thing is restricted to talbe owners afaics, making this
> >   decidedly less dicey, but still. But if anyobdy actually uses a
> >   search_path like the above you can easily hurt them.
> > * Same seems to hold true for the y = x and y.* IS DISTINCT FROM
> >   x.* operations.
> > * (y.*) = (x.*) also needs to do proper operator lookups.
>
> I wasn't aware that people could override the equality operators
> for tid and RECORD, so that seemed like unnecessary overhead.  I
> can pull the operators from the btree AM if people can do that.

Sure, You don't need any special privs for it:
CREATE OPERATOR public.= (
LEFTARG = tid,
RIGHTARG = tid,
PROCEDURE = tideq);
That's obviously not doing anything nefarious, since it calls the
builtin function, but you get the idea.

I think for the cases where you're comparing tids it's fine to just to
hardcode the operator to OPERATOR(pg_catalog.=).

> > * OpenMatViewIncrementalMaintenance() et al seem to be
> >   underdocumented.
>
> If the adjacent comment for
> MatViewIncrementalMaintenanceIsEnabled() left you at all uncertain
> about what OpenMatViewIncrementalMaintenance() and
> CloseMatViewIncrementalMaintenance() are for, I can add comments.
> At the time I wrote it, it seemed to me to be redundant to have
> such comments, and would cause people to waste more time looking at
> them then would be saved by anything they could add to what the
> function names and one-line function bodies conveyed; but if there
> was doubt in your mind about when they should be called, I'll add
> something.

I am not asking for comments that basically paraphrase the functionname,
I don't like those. More for something like:

"Normally DML statements aren't allowed on materialized views, but since
REFRESH CONCURRENTLY internally is implemented by running a bunch of
queries executing DML (c.f. refresh_by_match_merge()) we need to
temporarily allow it while refreshing. Callsites preventing DML and
similar on materialized views thus need to check whether we're currently
refreshing."

> Would it help to move the static functions underneath the
> (documented) public function and its comment?

Yes.

> > * why is it even allowed that matview_maintenance_depth is > 1?
> >   Unless I miss something there doesn't seem to be support for a
> >   recursive refresh whatever that would mean?
>
> I was trying to create the function in a way that would be useful
> for incremental maintenance, too.  Those could conceivably recurse.
> Also, this way bugs in usage are more easily detected.

Also, that belongs in a comment.

> > * INSERT and DELETE should also alias the table names, to make
> >   sure there cannot be issues with the nested queries.
>
> I don't see the hazard -- could you elaborate?

I don't think there is an actual hazard *right now*. But at least for
the DELETE case it took me a second or two to make sure there's no case
a matview called 'd' cannot cause problems.

> > * I'd strongly suggest more descriptive table aliases than x, y,
> >   d. Those make the statements rather hard to parse.
>
> I guess.  Those are intended to be internal, but I guess there's no
> reason not to be more verbose in the aliases.

Well, for one, other people will read that code every now and then. I am
not 100% convinced that all the statements are correct, and it's more
effort to do so right now. Also, those statements will show up in error
messages.

I first wanted to amend the above to say that they will also show up in
pg_stat_activity, but it doesn't look like it atm, you're never setting
anything. Which is consistent with other commands, even if it strikes me
as suboptimal.

> > * SPI_exec() is deprecated in favor of SPI_execute()

> According to what?  The documentation of SPI_exec doesn't say
> anything about it being deprecated; it does say:

I only remembered the following comment from spi.c:
/* Obsolete version of SPI_execute */

I guess, if the docs don't show that...

> > * ISTM deferred uniqueness checks and similar things should be
> >   enforced to run ub refresh_by_match_merge()
>
> "ub"?  That looks like a typo, but I'm not sure what you mean.
> Could you show a case where there is a hazard?

I think I wanted to type "in" but moved one row to far to the
left... What I am thinking of is that you'll get a successfull REFRESH
CONCURRENTLY but it will later error out at COMMIT time because there
were constraint violations. Afaik we don't have any such behaviour for
existing DDL and I don't like introducing it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)postgresql(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-08-01 04:22:16
Message-ID: 20130801042216.GA331932@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Wed, Jul 31, 2013 at 03:39:14AM +0200, Andres Freund wrote:
> On 2013-07-23 09:27:34 -0700, Kevin Grittner wrote:
> > Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > > * Shouldn't the GetUserIdAndSecContext(), NewGUCNestLevel() dance
> > >   be done a good bit earlier? We're executing queries before
> > >   that.
> >
> > This can't be in effect while we are creating temp tables.  If
> > we're going to support a differential REFRESH technique, it must be
> > done more "surgically" than to wrap the whole REFRESH statement
> > execution.

That InSecurityRestrictedOperation() check happens in DefineRelation(); I
advise bypassing it by calling heap_create_with_catalog() directly.
make_new_heap() is a comparable example. It's a good thing for more reasons
than the one prompting this. For example, the current implementation will
fail if the caller lacks TEMP permission on the database. Though it's
probably a rare thing to deny TEMP but allow creating permanent objects such
as a MV, that's a wart.

> Sure, it cannot be active when creating the temp table. But it's
> currently inactive while you SPI_execute queries. That can't be right.

It may be okay once the operator search issues Andres notes later are fixed.
The queries that now run outside a security restricted context _should_ call
only functions associated with default B-tree or hash equality operators.
Since only superusers can define operator classes, we can assume that those
operators never do something unseemly. But if the code will indeed be safe
for that reason, I would hope for a comment explaining as much. That would
also make the SetUserIdAndSecContext() calls later in refresh_by_match_merge()
unnecessary. All that being said, the implementation would be more robust if
we remove the SQL-level use of CREATE TEMP TABLE and return to running the
entire operation as the MV owner.

> > > * All not explicitly looked up operators should be qualified
> > >   using OPERATOR(). It's easy to define your own = operator for
> > >   tid and set the search_path to public,pg_catalog. Now, the
> > >   whole thing is restricted to talbe owners afaics, making this
> > >   decidedly less dicey, but still. But if anyobdy actually uses a
> > >   search_path like the above you can easily hurt them.
> > > * Same seems to hold true for the y = x and y.* IS DISTINCT FROM
> > >   x.* operations.
> > > * (y.*) = (x.*) also needs to do proper operator lookups.

A few supplements to Andres's comments:

1. ri_triggers.c is the model for constructing queries in a manner that
guarantees uniform semantics in the face of overloading and search_path
variations. Check out how it handles these issues.

2. DISTINCT FROM and IN syntax cannot be used in this kind of query at all,
because they always depend on the search path. However, you could construct
your own DistinctExpr node referencing the operator you want.

Here's an SQL statement generated by this code that currently uses "rowvar IS
DISTINCT FROM rowvar":

CREATE TEMP TABLE pg_temp_1.pg_temp_41108_2 AS
SELECT x.ctid AS tid, y
FROM public.mv x FULL JOIN pg_temp_1.pg_temp_41108 y ON
(y.key OPERATOR(pg_catalog.=) x.key AND y = x)
WHERE (y.*) IS DISTINCT FROM (x.*)
ORDER BY tid

The point of the "(y.*) IS DISTINCT FROM (x.*)" is just to filter out the
inner portion of the full join, right? Could write that a different way.

> I think for the cases where you're comparing tids it's fine to just to
> hardcode the operator to OPERATOR(pg_catalog.=).

Yes, that's fine whenever you're dealing with a known data type that uses "="
as its default B-tree equality operator (e.g. nearly every built-in type). So
"(x.*) OPERATOR(pg_catalog.=) (y.*)" for records is fine, too.

Kevin, this code is aware that "rowvar = rowvar" uses DISTINCT FROM semantics
for individual columns, right? The block comment at refresh_by_match_merge()
talks about NULL comparisons, but I couldn't tell definitively whether it
recognizes that fact.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-08-05 15:37:57
Message-ID: 1375717077.37431.YahooMailNeo@web162901.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Some of the issues raised by Andres and Noah have been addressed.
These all seemed simple and non-controversial, so I've just applied
the suggested fixes.

Some issues remain, such as how best to create the temp table used
for the "diff" data, and the related simplification of the security
context swapping that might become possible with a change there.
Review of that area has raised a couple other questions that I'm
looking into.  These all probably amount to enough that I will add
the patch(es) to address them to the next CF.

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-07-23 09:27:34 -0700, Kevin Grittner wrote:
>> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

>>> The loop over indexes in refresh_by_match_merge should
>>> index_open(ExclusiveLock) the indexes initially instead of
>>> searching the syscache manually. They are opened inside the
>>> loop in many cases anyway. There probably aren't any hazards
>>> currently, but I am not even sure about that. The index_open()
>>> in the loop at the very least processes the invalidation
>>> messages other backends send...

Fixed.

>>>   I'd even suggest using BuildIndexInfo() or such on the indexes,
>>>   then you could use ->ii_Expressions et al instead of peeking
>>>   into indkeys by hand.
>>
>> Given that the function is in a source file described as containing
>> "code to create and destroy POSTGRES index relations" and the
>> comments for that function say that it "stores the information
>> about the index that's needed by FormIndexDatum, which is used for
>> both index_build() and later insertion of individual index tuples,"
>> and we're not going to create or destroy an index or call
>> FormIndexDatum or insert individual index tuples from this code, it
>> would seem to be a significant expansion of the charter of that
>> function.  What benefits do you see to using that level?
>
> I'd prevent you from needing to peek into indkeys. Note that it's
> already used in various places.

I looked at where it was and wasn't used, and continue to be
skeptical.  For example, both techniques are used in tablecmds.c;
the technique you suggest is used where an index is being created,
dropped, or index tuples are being manipulated, while the
Form_pg_index structure is being used when the definition of the
index is being examined without directly manipulating it.  Compare
what is being done in my code with the existing code for
ATExecDropNotNull(), for example.

>>> [while comparison of indexed columns used OPERATOR() correctly,
>>> comparison of tid and rowvar values did not]

>> I wasn't aware that people could override the equality operators
>> for tid and RECORD

>> [example proving the possibility]

>> I think for the cases where you're comparing tids it's fine to just
>> to hardcode the operator to OPERATOR(pg_catalog.=).

Done.

>>> * I'd strongly suggest more descriptive table aliases than x, y,
>>>   d. Those make the statements rather hard to parse.
>>
>> I guess.  Those are intended to be internal, but I guess there's no
>> reason not to be more verbose in the aliases.
>
> Well, for one, other people will read that code every now and then. I am
> not 100% convinced that all the statements are correct, and it's more
> effort to do so right now. Also, those statements will show up in error
> messages.

Done.

> What I am thinking of is that you'll get a successfull REFRESH
> CONCURRENTLY but it will later error out at COMMIT time because there
> were constraint violations. Afaik we don't have any such behaviour for
> existing DDL and I don't like introducing it.

REFRESH MATERIALIZED VIEW CONCURRENTLY is definitely not DDL.  It
is DML, and behavior should be consistent with that.  (Note that
the definition of the matview remains exactly the same after the
statement executes as it was before; only the data is modified.)
Without the CONCURRENTLY clause it's in the same sort of gray area
as TRUNCATE TABLE, where it is essentially DML, but the
implementation details are similar to that of DDL, so it may
sometimes be hard to avoid DDL-like behaviors, even though it would
be best to do so.  We have no such problem here.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-08-05 16:14:39
Message-ID: 20130805161439.GB19850@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hi
On 2013-08-05 08:37:57 -0700, Kevin Grittner wrote:
> Some of the issues raised by Andres and Noah have been addressed.
> These all seemed simple and non-controversial, so I've just applied
> the suggested fixes.

Cool!

> >>>   I'd even suggest using BuildIndexInfo() or such on the indexes,
> >>>   then you could use ->ii_Expressions et al instead of peeking
> >>>   into indkeys by hand.
> >>
> >> Given that the function is in a source file described as containing
> >> "code to create and destroy POSTGRES index relations" and the
> >> comments for that function say that it "stores the information
> >> about the index that's needed by FormIndexDatum, which is used for
> >> both index_build() and later insertion of individual index tuples,"
> >> and we're not going to create or destroy an index or call
> >> FormIndexDatum or insert individual index tuples from this code, it
> >> would seem to be a significant expansion of the charter of that
> >> function.  What benefits do you see to using that level?
> >
> > I'd prevent you from needing to peek into indkeys. Note that it's
> > already used in various places.
>
> I looked at where it was and wasn't used, and continue to be
> skeptical.  For example, both techniques are used in tablecmds.c;
> the technique you suggest is used where an index is being created,
> dropped, or index tuples are being manipulated, while the
> Form_pg_index structure is being used when the definition of the
> index is being examined without directly manipulating it.  Compare
> what is being done in my code with the existing code for
> ATExecDropNotNull(), for example.

The RelationGetIndexExpressions() you mentioned in the commit sounds
like a good plan to me. Didn't remember that existed.

Don't think the ATExecDropNotNull() comparison is really valid, we need
to know more details there, but anyway, you've found something a good
bit better.

> > What I am thinking of is that you'll get a successfull REFRESH
> > CONCURRENTLY but it will later error out at COMMIT time because there
> > were constraint violations. Afaik we don't have any such behaviour for
> > existing DDL and I don't like introducing it.0
>
> REFRESH MATERIALIZED VIEW CONCURRENTLY is definitely not DDL.  It
> is DML, and behavior should be consistent with that.  (Note that
> the definition of the matview remains exactly the same after the
> statement executes as it was before; only the data is modified.)
> Without the CONCURRENTLY clause it's in the same sort of gray area
> as TRUNCATE TABLE, where it is essentially DML, but the
> implementation details are similar to that of DDL, so it may
> sometimes be hard to avoid DDL-like behaviors, even though it would
> be best to do so.  We have no such problem here.

But there's no usecase that makes deferred checks and similar useful
afaics. And it seems to me like it certainly will confuse users that a
second RMVC fails (via CheckTableNotInUse()) because there's still a
deferred trigger queue.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services