Race condition in b-tree page deletion

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Race condition in b-tree page deletion
Date: 2013-11-09 16:02:02
Message-ID: 527E5C7A.3090902@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The B-tree page deletion algorithm has a race condition. We don't allow
a page to be deleted if it's the rightmost child of its parent, but that
situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right
sibling, and the parent, in that order. However, if the parent page is
split after the check but before acquiring the locks, the target page
might become the rightmost child, if the split happens at the right
place. That leads to an error in vacuum (I reproduced this by setting a
breakpoint in debugger):

ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and
throw the above error if it's not. We could easily just give up rather
than throw an error, but that approach doesn't scale to half-dead pages.
To recap, although we don't normally allow deleting the rightmost child,
if the page is the *only* child of its parent, we delete the child page
and mark the parent page as half-dead in one atomic operation. But
before we do that, we check that the parent can later be deleted, by
checking that it in turn is not the rightmost child of the grandparent
(potentially recursing all the way up to the root). But the same
situation can arise there - the grandparent can be split while we're not
holding the locks. We end up with a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace
at the grandparent level is "out-of-whack" until the half-dead page is
deleted, and if enough tuples with keys in the transferred keyspace are
inserted, the page might get split and a downlink might be inserted into
the grandparent that is out-of-order. That might not cause any serious
problem if it's transient (as the README ponders), but is surely bad if
it stays that way.

Solutions
---------

1. The simplest solution I can see is to never delete internal pages,
avoiding the whole concept of half-dead pages. That obviously leads to
some bloat, though.

2. The second-simplest solution I see is to keep locked the whole chain
of pages that will be deleted, and delete all of them as one atomic
WAL-logged operation. Ie. the leaf page, and all the parent pages above
it that will become half-dead, and the parent of the last half-dead page.

3. Another approach would be to get rid of the "can't delete rightmost
child" limitation. We currently have that limitation because it ensures
that we never need to change the high-key of a page. If we delete a page
that is the rightmost child of its parent, we transfer the deleted
keyspace from the parent page to its right sibling. To do that, we need
to update the high key of the parent, as well as the downlink of the
right sibling at the grandparent level. That's a bit complicated,
because updating the high key might require splitting the page.

Still, I think it would be doable: When we delete a page that is the
rightmost child of its parent, we mark the right sibling with an
"out-of-whack" flag. It indicates that the keyspace of that page is not
in sync in the parent level. Searches work fine, as there are no keys in
the keyspace that is out-of-whack, but before allowing any insertions to
the page, the situation needs to be fixed. That is the responsibility of
the next insert to the page that comes along. To fix it, the parent's
left sibling's high key needs to be updated, as well as the parent's
downlink in the grandparent. We can do that in a few discrete steps;
first update the high key, then update the downlink, and finally once
the high key and parent's downlink are in sync with the reality at the
bottom level, clear the "out-of-whack" flag.

On reflection, approach 2 seems best. It's fairly simple, although it
potentially requires holding many pages locked simultaneously. In
practice, B-trees are typically not very deep. We have a limit of 4
full-page images in a single WAL record, but since all the pages are
empty, we can just WAL-log all the information needed to reconstruct the
pages in deleted state without using full-page images. This also might
be back-patchable, although it requires changing the WAL record format,
so it would break replication from higher minor version to lower.

- Heikki

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-11-09 16:24:46 Re: Race condition in b-tree page deletion
Previous Message Craig Ringer 2013-11-09 15:01:33 Re: Row-security writer-side checks proposal