Re: Race condition in b-tree page deletion

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in b-tree page deletion
Date: 2013-11-12 15:13:07
Message-ID: CA+TgmoYShtMdL-qikpYmHzC+94sb_oPRWGpCYfU-fBxnBmM=uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 10.11.2013 01:47, Robert Haas wrote:
>> I think we've tried pretty hard to avoid algorithms where the maximum
>> number of lwlocks that must be held at one time is not a constant, and
>> I think we're in for a bad time of it if we start to deviate from that
>> principal. I'm not sure what to do about this problem, but I think
>> locking N levels of the tree at once, where N can be as large as the
>> tree is deep, is probably a bad plan, whether the number of locks
>> required is N or 3N.
>
> I think I found a solution that accomplishes that. It's actually not that
> complicated:
>
> Like we currently do, first climb up the tree to check that it's safe to
> delete, ie. the downlink in the first non-empty parent is not the rightmost
> entry. But when we reach the level where the parent is non-empty - I'll call
> that the "topmost" parent - we keep that page locked. The leaf page is kept
> locked while we climb.
>
> This is enough to plug the race condition. As long as we hold a lock on the
> topmost parent containing the downlink to the branch of pages we're about to
> delete, it cannot become the rightmost entry. And as long as we hold a lock
> on the leaf page, no new insertions can happen on any of the internal pages
> in the branch, as insertions to internal pages only happen when a child is
> split. However, the rest of the algorithm needs to be slightly modified, as
> we cannot re-lock the lower-level pages until we release the lock on the
> topmost page, to avoid deadlock.
>
> So at this point, we hold two locks: the leaf page, and the topmost parent
> containing the downlink to the branch we're deleting. Next, we remove the
> downlink from the topmost parent, and mark the leaf page as half-dead in one
> atomic operation. Also, a pointer to the highest internal page in the branch
> we're deleting - the one the removed downlink pointed to - is put on the
> leaf page. We can now release the locks.
>
> At this point, searches and inserts work fine. The leaf page has been marked
> as half-dead, so any insertions to the deleted page's keyspace will go to
> its right sibling. The downlink is to the top of the branch is gone, so even
> if the right sibling is split many times, any keys in the transferred
> keyspace that propagate to the higher levels won't be out-of-order.
>
> All that is left is to unlink the all the lingering pages in the branch
> we're deleting from their left and right siblings. This can be done at any
> later time, and if we error out or crash for some reason, next vacuum that
> comes along can finish the job. This is done one level at a time. Lock the
> leaf page, and the internal page the leaf page points to, and the internal
> page's left and right siblings (in the right order, not this order). Change
> the left and right sibling's right- and left-links, mark the internal page
> as deleted, and update the pointer in the leaf page to point to the child of
> the deleted internal page. Then recurse to the child, until we reach the
> leaf level.
>
> This has the nice extra property that we don't need the incomplete-action
> tracking in WAL recovery. I'd like to get rid of that anyway.

I am not very knowledgeable of this code, but at least with my limited
knowledge I can't spot any problems with that approach. The only
thing that seems a little unfortunate is that the next vacuum may need
to finish cleaning things up; I thought at one point you were trying
to get rid of the amvacuumcleanup stuff. But maybe I'm
misremembering, and anyway it still seems like a step forward.

> I'm not sure what to do about stable branches. This could be back-patched,
> with the caveat that this introduces new WAL record types so standbys need
> to be upgraded before the master. But given the lack of field reports in the
> ten years this race condition has existed, I'm not sure it's worth the
> hassle.

In the absence of at least one (and maybe several) reports from the
field, -1 for back-patching this. At this point, it seems like far
more people will be confused by the need to upgrade things in order
than will benefit from the fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-11-12 15:14:27 Re: MVCC snapshot timing
Previous Message Robert Haas 2013-11-12 15:04:11 Re: Relax table alias conflict rule in 9.3?