Re: refresh materialized view concurrently

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: refresh materialized view concurrently
Date: 2013-06-26 08:38:33
Message-ID: 1372235913.24059.YahooMailNeo@web162901.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:

> I spent a few hours to review the patch.

Thanks!

> As far as I can tell, the overall approach is as follows.
>
> - create a new temp heap as non-concurrent does, but with
> ExclusiveLock on the matview, so that reader wouldn't be blocked

Non-concurrent creates the heap in the matview's tablespace and
namespace, so the "temp" part is different in concurrent
generation.  This difference is why concurrent can be faster when
few rows change.

Also, before the next step there is an ANALYZE of the temp table,
so the planner can make good choices in the next step.

> - with this temp table and the matview, query FULL JOIN and
> extract difference between original matview and temp heap (via SPI)

Right; into another temp table.

> - this operation requires unique index for performance reason (or
> correctness reason too)

It is primarily for correctness in the face of duplicate rows which
have no nulls.  Do you think the reasons need to be better
documented with comments?

> - run ANALYZE on this diff table
>
> - run UPDATE, INSERT and DELETE via SPI, to do the merge
>
> - close these temp heap

Right, then drop them.

> Assuming I'm asking something wrong and going for the current
> approach, here's what I've found in the patch.

> - I'm not sure if unique key index requirement is reasonable or
> not, because users only add CONCURRENTLY option and not merging
> or incremental update.

The patch would need to be about an order of magnitude more complex
without that requirement due to the problems handling duplicate
rows.  This patch uses set logic, and set logic falls down badly in
the face of duplicate rows.  Consider how you would try to make
this technique work if the "old" data has 3 versions of a row and
the "new" data in the temp table has 4 versions of that same row.
You can play around with that by modifying the examples of the
logic using regular tables I included with the first version of the
patch.  A UNIQUE index is the only way to prevent duplicate rows.
The fact that there can be duplicates with NULL in one or more of
the columns which are part of a duplicate index is not a fatal
problem; it just means that those cases much be handled through
DELETE and re-INSERT of the rows containing NULL in any column
which is part of a duplicate index key.

> - As others pointed out, quoteOneName can be replaced with
> quote_identifier

OK, I did it that way.  The quote_identifier and
quote_qualified_identifier functions seem rather clumsy for the
case where you already know you have an identifier (because you are
dealing with an open Relation).  I think we should have different
functions for that case, but that should probably not be material
for this patch, so changed to use the existing tools.

> - This looks harmless, but I wonder if you need to change relkind.
>
> *** 665,672 **** make_new_heap(Oid OIDOldHeap, Oid NewTableSpace)
>                                             OldHeap->rd_rel->relowner,
>                                             OldHeapDesc,
>                                             NIL,
> !                                           OldHeap->rd_rel->relkind,
> !                                           OldHeap->rd_rel->relpersistence,
>                                             false,
>                                             RelationIsMapped(OldHeap),
>                                             true,
> --- 680,687 ----
>                                             OldHeap->rd_rel->relowner,
>                                             OldHeapDesc,
>                                             NIL,
> !                                           RELKIND_RELATION,
> !                                           relpersistence,
>                                             false,
>                                             RelationIsMapped(OldHeap),
>                                             true,
>
>
> Since OldHeap->rd_rel->relkind has been working with 'm', too,
> not only 'r'?

No, the relation created by this is not going to be around when
we're done; we're either going to move its heap onto the existing
matview or drop the temp table.  Either way, it's OK for it to be a
table until we do.  I don't see the benefit of complicating the
code to do otherwise.

> - I found two additional parameters on make_new_heap ugly, but
> couldn't come up with better solution.  Maybe we can pass
> Relation of old heap to the function instead of Oid..

This was the cleanest way I could see.  Opening the relation
elsewhere and passing it in would not only be uglier IMO, it would
do nothing to tell the function that it should create a temporary
table.

I also modified the confusing error message to something close to
the suggestion from Robert.

What to do about the Assert that the matview is not a system
relation seems like material for a separate patch.  After review,
I'm inclined to remove the test altogether, so that extensions can
create matviews in pg_catalog.

New version attached.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
refresh-concurrently-v2.patch text/x-diff 36.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2013-06-26 08:39:24 Re: Bloom Filter lookup for hash joins
Previous Message Pavel Stehule 2013-06-26 08:38:31 Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]