Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)postgresql(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
Date: 2013-07-31 01:39:14
Message-ID: 20130731013914.GD19053@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Hi Kevin,

On 2013-07-23 09:27:34 -0700, Kevin Grittner wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> > Hm. There seems to be more things that need some more improvement
> > from a quick look.
> >
> > First, I have my doubts of the general modus operandy of
> > CONCURRENTLY here. I think in many, many cases it would be faster
> > to simply swap in the newly built heap + indexes in concurrently.
> > I think I have seen that mentioned in the review while mostly
> > skipping the discussion, but still. That would be easy enough as
> > of Robert's mvcc patch.
>
> Yeah, in many cases you would not want to choose to specify this
> technique. The point was to have a technique which could run
> without blocking while a heavy load of queries (including perhaps a
> very long-running query) was executing.  If subsequent events
> provide an alternative way to do that, it's worth looking at it;
> but my hope was to get this out of the way to allow time to develop
> incremental maintenance of matviews for the next CF.

I thought there was a pretty obvious way to do this concurrently. But
after mulling over it for a few days I either miss a connection I made
previously or, more likely, I have missed some fundamental difficulties
previously.

I'd would be fairly easy if indexes were attached to a relations
relfilenode, not a relations oid... But changing that certainly doesn't
fall in the "easy enough" category anymore.

> The bigger point is perhaps syntax.  If MVCC catalogs are really at
> a point where we can substitute a new heap and new indexes for a
> matview without taking an AccessExclusiveLock, then we should just
> make the current code do that.  I think there is still a place for
> a differential update for large matviews which only need small
> changes on a REFRESH, but perhaps the right term for that is not
> CONCURRENTLY.

Hm. Yes. Especially as any CONCURRENTLY implementation building and
swapping in a new heap that I can think of requires to be run outside a
transaction (like CIC et al).

> > * Shouldn't the GetUserIdAndSecContext(), NewGUCNestLevel() dance
> >   be done a good bit earlier? We're executing queries before
> >   that.
>
> This can't be in effect while we are creating temp tables.  If
> we're going to support a differential REFRESH technique, it must be
> done more "surgically" than to wrap the whole REFRESH statement
> execution.

Sure, it cannot be active when creating the temp table. But it's
currently inactive while you SPI_execute queries. That can't be right.

> >   I'd even suggest using BuildIndexInfo() or such on the indexes,
> >   then you could use ->ii_Expressions et al instead of peeking
> >   into indkeys by hand.
>
> Given that the function is in a source file described as containing
> "code to create and destroy POSTGRES index relations" and the
> comments for that function say that it "stores the information
> about the index that's needed by FormIndexDatum, which is used for
> both index_build() and later insertion of individual index tuples,"
> and we're not going to create or destroy an index or call
> FormIndexDatum or insert individual index tuples from this code, it
> would seem to be a significant expansion of the charter of that
> function.  What benefits do you see to using that level?

I'd prevent you from needing to peek into indkeys. Note that it's
already used in various places.

> > * All not explicitly looked up operators should be qualified
> >   using OPERATOR(). It's easy to define your own = operator for
> >   tid and set the search_path to public,pg_catalog. Now, the
> >   whole thing is restricted to talbe owners afaics, making this
> >   decidedly less dicey, but still. But if anyobdy actually uses a
> >   search_path like the above you can easily hurt them.
> > * Same seems to hold true for the y = x and y.* IS DISTINCT FROM
> >   x.* operations.
> > * (y.*) = (x.*) also needs to do proper operator lookups.
>
> I wasn't aware that people could override the equality operators
> for tid and RECORD, so that seemed like unnecessary overhead.  I
> can pull the operators from the btree AM if people can do that.

Sure, You don't need any special privs for it:
CREATE OPERATOR public.= (
LEFTARG = tid,
RIGHTARG = tid,
PROCEDURE = tideq);
That's obviously not doing anything nefarious, since it calls the
builtin function, but you get the idea.

I think for the cases where you're comparing tids it's fine to just to
hardcode the operator to OPERATOR(pg_catalog.=).

> > * OpenMatViewIncrementalMaintenance() et al seem to be
> >   underdocumented.
>
> If the adjacent comment for
> MatViewIncrementalMaintenanceIsEnabled() left you at all uncertain
> about what OpenMatViewIncrementalMaintenance() and
> CloseMatViewIncrementalMaintenance() are for, I can add comments.
> At the time I wrote it, it seemed to me to be redundant to have
> such comments, and would cause people to waste more time looking at
> them then would be saved by anything they could add to what the
> function names and one-line function bodies conveyed; but if there
> was doubt in your mind about when they should be called, I'll add
> something.

I am not asking for comments that basically paraphrase the functionname,
I don't like those. More for something like:

"Normally DML statements aren't allowed on materialized views, but since
REFRESH CONCURRENTLY internally is implemented by running a bunch of
queries executing DML (c.f. refresh_by_match_merge()) we need to
temporarily allow it while refreshing. Callsites preventing DML and
similar on materialized views thus need to check whether we're currently
refreshing."

> Would it help to move the static functions underneath the
> (documented) public function and its comment?

Yes.

> > * why is it even allowed that matview_maintenance_depth is > 1?
> >   Unless I miss something there doesn't seem to be support for a
> >   recursive refresh whatever that would mean?
>
> I was trying to create the function in a way that would be useful
> for incremental maintenance, too.  Those could conceivably recurse.
> Also, this way bugs in usage are more easily detected.

Also, that belongs in a comment.

> > * INSERT and DELETE should also alias the table names, to make
> >   sure there cannot be issues with the nested queries.
>
> I don't see the hazard -- could you elaborate?

I don't think there is an actual hazard *right now*. But at least for
the DELETE case it took me a second or two to make sure there's no case
a matview called 'd' cannot cause problems.

> > * I'd strongly suggest more descriptive table aliases than x, y,
> >   d. Those make the statements rather hard to parse.
>
> I guess.  Those are intended to be internal, but I guess there's no
> reason not to be more verbose in the aliases.

Well, for one, other people will read that code every now and then. I am
not 100% convinced that all the statements are correct, and it's more
effort to do so right now. Also, those statements will show up in error
messages.

I first wanted to amend the above to say that they will also show up in
pg_stat_activity, but it doesn't look like it atm, you're never setting
anything. Which is consistent with other commands, even if it strikes me
as suboptimal.

> > * SPI_exec() is deprecated in favor of SPI_execute()

> According to what?  The documentation of SPI_exec doesn't say
> anything about it being deprecated; it does say:

I only remembered the following comment from spi.c:
/* Obsolete version of SPI_execute */

I guess, if the docs don't show that...

> > * ISTM deferred uniqueness checks and similar things should be
> >   enforced to run ub refresh_by_match_merge()
>
> "ub"?  That looks like a typo, but I'm not sure what you mean.
> Could you show a case where there is a hazard?

I think I wanted to type "in" but moved one row to far to the
left... What I am thinking of is that you'll get a successfull REFRESH
CONCURRENTLY but it will later error out at COMMIT time because there
were constraint violations. Afaik we don't have any such behaviour for
existing DDL and I don't like introducing it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Fujii Masao 2013-07-31 13:37:53 pgsql: Fix inaccurate description of tablespace.
Previous Message Noah Misch 2013-07-31 00:03:56 pgsql: Restore REINDEX constraint validation.

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-07-31 02:14:34 Re: Computer VARSIZE_ANY(PTR) during debugging
Previous Message Amit Langote 2013-07-31 00:57:50 Re: Computer VARSIZE_ANY(PTR) during debugging