Race condition within _bt_findinsertloc()? (new page split code)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 06:17:59
Message-ID: CAM3SWZR77w4Fq26H0t=3n4d=142MWgHrH0EL4+1GpTFZb3HXAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/*
* If this page was incompletely split, finish the split now. We
* do this while holding a lock on the left sibling, which is not
* good because finishing the split could be a fairly lengthy
* operation. But this should happen very seldom.
*/
if (P_INCOMPLETE_SPLIT(lpageop))
{
_bt_finish_split(rel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}

The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement. It's the
first sibling iff we're on our first iteration of the nested for(;;)
loop in _bt_findinsertloc(). So the buffer lock held on this left
sibling may constitute what in the past I've called a "value lock";
we've established the right to insert our value into the unique index
at this point, and the lock will only be released when we're done
(regardless of whether or not that buffer/page value lock is on the
buffer/page we'll ultimately insert into, or an earlier one).

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.

* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.

* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this). It exclusive locks that later page/buffer before the second guy
gets a chance to lock it once again. It establishes the right to
insert with _bt_check_unique(), undeterred by the second inserter's
buffer lock/"value lock". The value lock is effectively skipped over.

* Both the second and third inserters have "established the right" to
insert the same value, 7, and both do so. The unique index has an
MVCC-snapshot-wise spurious duplicate, and so is corrupt.

Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its "value lock", upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it "couldn't find" an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict. The only difference
is that we don't have an xact to wait on. We haven't actually done
anything extra that makes this later "goto top;" any different to the
existing one.

This should occur so infrequently that it isn't worth trying harder,
or worth differentiating between the UNIQUE_CHECK_NO and
!UNIQUE_CHECK_NO cases when retrying. This also removes the more
general risk of holding an extra buffer lock during page split
completion.

It kind of looks like _bt_findinsertloc() doesn't have this bug,
because in my scenario _bt_finish_split() is called with both the
value lock and its right page locked (so the right page is the left
page for _bt_finish_split()'s purposes). But when you take another
look, and realize that _bt_finish_split() releases its locks, and so
once again only the original value lock will be held when
_bt_finish_split() returns, and so the downlink is there to skip the
value locked page, you realize that the bug does exist (assuming that
I haven't failed to consider some third factor and am not otherwise
mistaken).

Thoughts?
--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-05-27 06:57:36 Re: Spreading full-page writes
Previous Message Amit Kapila 2014-05-27 05:05:04 Re: postgresql.auto.conf read from wrong directory