Re: Support for REINDEX CONCURRENTLY

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for REINDEX CONCURRENTLY
Date: 2013-01-25 04:48:50
Message-ID: CAB7nPqRwVtQcHWErUf9o0hrRGFyQ9xArk7K7jCLxqKLy_6CXPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

All the comments are addressed in version 8 attached, except for the
hashtable part, which requires some heavy changes.

On Thu, Jan 24, 2013 at 3:41 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

> On 2013-01-15 18:16:59 +0900, Michael Paquier wrote:
> > Code path used by REINDEX concurrently permits to
> > create an index in parallel of an existing one and not a completely new
> > index. Shouldn't this work for indexes used by exclusion indexes also?
>
> But that fact might safe things. I don't immediately see any reason that
> adding a
> if (!indisvalid)
> return;
> to check_exclusion_constraint wouldn't be sufficient if there's another
> index with an equivalent definition.
>
Indeed, this might be enough as for CREATE INDEX CONCURRENTLY this code
path cannot be taken and only indexes created concurrently can be invalid.
Hence I am adding that in the patch with a comment explaining why.

> > > > + /*
> > > > + * Phase 2 of REINDEX CONCURRENTLY
> > > > + *
> > > > + * Build concurrent indexes in a separate transaction for each
> > > index to
> > > > + * avoid having open transactions for an unnecessary long time.
> > > We also
> > > > + * need to wait until no running transactions could have the
> > > parent table
> > > > + * of index open. A concurrent build is done for each
> concurrent
> > > > + * index that will replace the old indexes.
> > > > + */
> > > > +
> > > > + /* Get the first element of concurrent index list */
> > > > + lc2 = list_head(concurrentIndexIds);
> > > > +
> > > > + foreach(lc, indexIds)
> > > > + {
> > > > + Relation indexRel;
> > > > + Oid indOid = lfirst_oid(lc);
> > > > + Oid concurrentOid =
> lfirst_oid(lc2);
> > > > + Oid relOid;
> > > > + bool primary;
> > > > + LOCKTAG *heapLockTag = NULL;
> > > > + ListCell *cell;
> > > > +
> > > > + /* Move to next concurrent item */
> > > > + lc2 = lnext(lc2);
> > > > +
> > > > + /* Start new transaction for this index concurrent
> build */
> > > > + StartTransactionCommand();
> > > > +
> > > > + /* Get the parent relation Oid */
> > > > + relOid = IndexGetRelation(indOid, false);
> > > > +
> > > > + /*
> > > > + * Find the locktag of parent table for this index, we
> > > need to wait for
> > > > + * locks on it.
> > > > + */
> > > > + foreach(cell, lockTags)
> > > > + {
> > > > + LOCKTAG *localTag = (LOCKTAG *) lfirst(cell);
> > > > + if (relOid == localTag->locktag_field2)
> > > > + heapLockTag = localTag;
> > > > + }
> > > > +
> > > > + Assert(heapLockTag && heapLockTag->locktag_field2 !=
> > > InvalidOid);
> > > > + WaitForVirtualLocks(*heapLockTag, ShareLock);
> > >
> > > Why do we have to do the WaitForVirtualLocks here? Shouldn't we do this
> > > once for all relations after each phase? Otherwise the waiting time
> will
> > > really start to hit when you do this on a somewhat busy server.
> > >
> > Each new index is built and set as ready in a separate single
> transaction,
> > so doesn't it make sense to wait for the parent relation each time. It is
> > possible to wait for a parent relation only once during this phase but in
> > this case all the indexes of the same relation need to be set as ready in
> > the same transaction. So here the choice is either to wait for the same
> > relation multiple times for a single index or wait once for a parent
> > relation but we build all the concurrent indexes within the same
> > transaction. Choice 1 makes the code clearer and more robust to my mind
> as
> > the phase 2 is done clearly for each index separately. Thoughts?
>
> As far as I understand that code its purpose is to enforce that all
> potential users have an up2date definition available. For that we
> acquire a lock on all virtualxids of users using that table thus waiting
> for them to finish.
> Consider the scenario where you have a workload where most transactions
> are fairly long (say 10min) and use the same tables (a,b)/indexes(a_1,
> a_2, b_1, b_2). With the current strategy you will do:
>
> WaitForVirtualLocks(a_1) -- wait up to 10min
> index_build(a_1)
> WaitForVirtualLocks(a_2) -- wait up to 10min
> index_build(a_2)
>
...
>
> So instead of waiting up 10 minutes for that phase you have to wait up
> to 40.
>
This is necessary if you want to process each index entry in a different
transaction as WaitForVirtualLocks needs to wait for the locks held on the
parent table. If you want to fo this wait once per transaction, the
solution would be to group the index builds in the same transaction for all
the indexes of the relation. One index per transaction looks more solid in
this case if there is a failure during a process only one index will be
incorrectly built. Also, when you run a REINDEX CONCURRENTLY, you should
not need to worry about the time it takes. The point is that this operation
is done in background and that the tables are still accessible during this
time.

> > > > + indexRel = index_open(indOid,
> ShareUpdateExclusiveLock);
> > >
> > > I wonder we should directly open it exlusive here given its going to
> > > opened exclusively in a bit anyway. Not that that will really reduce
> the
> > > deadlock likelihood since we already hold the ShareUpdateExclusiveLock
> > > in session mode ...
> > >
> > I tried to use an AccessExclusiveLock here but it happens that this is
> not
> > compatible with index_set_state_flags. Does taking an exclusive lock
> > increments the transaction ID of running transaction? Because what I am
> > seeing is that taking AccessExclusiveLock on this index does a
> transaction
> > update.
>
> Yep, it does when wal_level = hot_standby because it logs the exclusive
> lock to wal so the startup process on the standby can acquire it.
>
> Imo that Assert needs to be moved to the existing callsites if there
> isn't an equivalent one already.
>
OK. Letting the assertion inside index_set_state_flags makethes code more
consistent with CREATE INDEX CONCURRENTLY, so the existing behavior is fine.

>
> > For those reasons current code sticks with ShareUpdateExclusiveLock. Not
> a
> > big deal btw...
>
> Well, lock upgrades make deadlocks more likely.
>
> Ok, of to v7:
> + */
> +void
> +index_concurrent_swap(Oid newIndexOid, Oid oldIndexOid)
> ...
> + /*
> + * Take a lock on the old and new index before switching their
> names. This
> + * avoids having index swapping relying on relation renaming
> mechanism to
> + * get a lock on the relations involved.
> + */
> + oldIndexRel = relation_open(oldIndexOid, AccessExclusiveLock);
> + newIndexRel = relation_open(newIndexOid, AccessExclusiveLock);
> ..
> + /*
> + * If the index swapped is a toast index, take an exclusive lock
> on its
> + * parent toast relation and then update reltoastidxid to the
> new index Oid
> + * value.
> + */
> + if (get_rel_relkind(parentOid) == RELKIND_TOASTVALUE)
> + {
> + Relation pg_class;
> +
> + /* Open pg_class and fetch a writable copy of the relation
> tuple */
> + pg_class = heap_open(parentOid, RowExclusiveLock);
> +
> + /* Update the statistics of this pg_class entry with new
> toast index Oid */
> + index_update_stats(pg_class, false, false, newIndexOid,
> -1.0);
> +
> + /* Close parent relation */
> + heap_close(pg_class, RowExclusiveLock);
> + }
>
> ISTM the RowExclusiveLock on the toast table should be acquired before
> the locks on the indexes.
>
Done.

> +index_concurrent_set_dead(Oid indexId, Oid heapId, LOCKTAG locktag)
> +{
> + Relation heapRelation;
> + Relation indexRelation;
> +
> + /*
> + * Now we must wait until no running transaction could be using the
> + * index for a query. To do this, inquire which xacts currently
> would
> + * conflict with AccessExclusiveLock on the table -- ie, which ones
> + * have a lock of any kind on the table. Then wait for each of
> these
> + * xacts to commit or abort. Note we do not need to worry about
> xacts
> + * that open the table for reading after this point; they will see
> the
> + * index as invalid when they open the relation.
> + *
> + * Note: the reason we use actual lock acquisition here, rather
> than
> + * just checking the ProcArray and sleeping, is that deadlock is
> + * possible if one of the transactions in question is blocked
> trying
> + * to acquire an exclusive lock on our table. The lock code will
> + * detect deadlock and error out properly.
> + *
> + * Note: GetLockConflicts() never reports our own xid, hence we
> need
> + * not check for that. Also, prepared xacts are not reported,
> which
> + * is fine since they certainly aren't going to do anything more.
> + */
> + WaitForVirtualLocks(locktag, AccessExclusiveLock);
>
> Most of that comment seems to belong to WaitForVirtualLocks instead of
> this specific caller of WaitForVirtualLocks.
>
Done.

> A comment in the header that it is doing the waiting would also be good.
>
> In ReindexRelationsConcurrently I suggest s/bypassing/skipping/.
>
Done.

> Btw, seing that we have an indisvalid check the toast table's index, do
> we have any way to cleanup such a dead index? I don't think its allowed
> to drop the index of a toast table. I.e. we possibly need to relax that
> check for invalid indexes :/.
>
For the time being, no I don't think so, except by doing a manual cleanup
and remove the invalid pg_class entry in catalogs. One way to do thath
cleanly could be to have autovacuum remove the invalid toast indexes
automatically, but it is not dedicated to that and this is another
discussion.

> I think the usage of list_append_unique_oids in
> ReindexRelationsConcurrently might get too expensive in larger
> schemas. Its O(n^2) in the current usage and schemas with lots of
> relations/indexes aren't unlikely candidates for this feature.
> The easist solution probably is to use a hashtable.
>
Hum... This requires some thinking that will change the basics inside
ReindexRelationsConcurrently...
Let me play a bit with the hashtable APIs and I'll come back to that later.

ReindexRelationsConcurrently should do a CHECK_FOR_INTERRUPTS() every
> once in a while, its currently not gracefully interruptible which
> probably is bad in a bigger schema.
>
Done. I added some checks at each phase before beginning a new transaction.
--
Michael Paquier
http://michael.otacoo.com

Attachment Content-Type Size
20130125_reindex_concurrently_v8.patch application/octet-stream 81.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2013-01-25 05:11:39 Re: Support for REINDEX CONCURRENTLY
Previous Message David Fetter 2013-01-25 04:27:28 Re: .gitignore additions