Quick Links

Re: Race condition in b-tree page deletion

Lists:	pgsql-hackers

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Race condition in b-tree page deletion
Date:	2013-11-09 16:02:02
Message-ID:	527E5C7A.3090902@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The B-tree page deletion algorithm has a race condition. We don't allow
a page to be deleted if it's the rightmost child of its parent, but that
situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right
sibling, and the parent, in that order. However, if the parent page is
split after the check but before acquiring the locks, the target page
might become the rightmost child, if the split happens at the right
place. That leads to an error in vacuum (I reproduced this by setting a
breakpoint in debugger):

ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and
throw the above error if it's not. We could easily just give up rather
than throw an error, but that approach doesn't scale to half-dead pages.
To recap, although we don't normally allow deleting the rightmost child,
if the page is the *only* child of its parent, we delete the child page
and mark the parent page as half-dead in one atomic operation. But
before we do that, we check that the parent can later be deleted, by
checking that it in turn is not the rightmost child of the grandparent
(potentially recursing all the way up to the root). But the same
situation can arise there - the grandparent can be split while we're not
holding the locks. We end up with a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace
at the grandparent level is "out-of-whack" until the half-dead page is
deleted, and if enough tuples with keys in the transferred keyspace are
inserted, the page might get split and a downlink might be inserted into
the grandparent that is out-of-order. That might not cause any serious
problem if it's transient (as the README ponders), but is surely bad if
it stays that way.

Solutions
---------

1. The simplest solution I can see is to never delete internal pages,
avoiding the whole concept of half-dead pages. That obviously leads to
some bloat, though.

2. The second-simplest solution I see is to keep locked the whole chain
of pages that will be deleted, and delete all of them as one atomic
WAL-logged operation. Ie. the leaf page, and all the parent pages above
it that will become half-dead, and the parent of the last half-dead page.

3. Another approach would be to get rid of the "can't delete rightmost
child" limitation. We currently have that limitation because it ensures
that we never need to change the high-key of a page. If we delete a page
that is the rightmost child of its parent, we transfer the deleted
keyspace from the parent page to its right sibling. To do that, we need
to update the high key of the parent, as well as the downlink of the
right sibling at the grandparent level. That's a bit complicated,
because updating the high key might require splitting the page.

Still, I think it would be doable: When we delete a page that is the
rightmost child of its parent, we mark the right sibling with an
"out-of-whack" flag. It indicates that the keyspace of that page is not
in sync in the parent level. Searches work fine, as there are no keys in
the keyspace that is out-of-whack, but before allowing any insertions to
the page, the situation needs to be fixed. That is the responsibility of
the next insert to the page that comes along. To fix it, the parent's
left sibling's high key needs to be updated, as well as the parent's
downlink in the grandparent. We can do that in a few discrete steps;
first update the high key, then update the downlink, and finally once
the high key and parent's downlink are in sync with the reality at the
bottom level, clear the "out-of-whack" flag.

On reflection, approach 2 seems best. It's fairly simple, although it
potentially requires holding many pages locked simultaneously. In
practice, B-trees are typically not very deep. We have a limit of 4
full-page images in a single WAL record, but since all the pages are
empty, we can just WAL-log all the information needed to reconstruct the
pages in deleted state without using full-page images. This also might
be back-patchable, although it requires changing the WAL record format,
so it would break replication from higher minor version to lower.

- Heikki

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-09 16:24:46
Message-ID:	32194.1384014286@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> 2. The second-simplest solution I see is to keep locked the whole chain
> of pages that will be deleted, and delete all of them as one atomic
> WAL-logged operation. Ie. the leaf page, and all the parent pages above
> it that will become half-dead, and the parent of the last half-dead page.

This would be more tenable if we could put a known limit on the number of
pages to be changed at once. I'm not too awake at the moment, so maybe
this is not possible, but could we simply decide in advance that we won't
propagate page deletion up more than a-small-number of levels? It'd
require allowing a page to remain half-dead until some other vacuum comes
along and fixes it, but that seems OK.

regards, tom lane

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-09 16:49:00
Message-ID:	527E677C.8040404@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.11.2013 18:24, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> 2. The second-simplest solution I see is to keep locked the whole chain
>> of pages that will be deleted, and delete all of them as one atomic
>> WAL-logged operation. Ie. the leaf page, and all the parent pages above
>> it that will become half-dead, and the parent of the last half-dead page.
>
> This would be more tenable if we could put a known limit on the number of
> pages to be changed at once. I'm not too awake at the moment, so maybe
> this is not possible, but could we simply decide in advance that we won't
> propagate page deletion up more than a-small-number of levels? It'd
> require allowing a page to remain half-dead until some other vacuum comes
> along and fixes it, but that seems OK.

I don't feel comfortable leaving pages in half-dead state. Looking back
at the archives, your original design ten years ago did exactly that
(http://www.postgresql.org/message-id/8281.1045089764@sss.pgh.pa.us),
but that turned out to be a bad idea
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=70ce5c908202ada7616f7afded8a91bbf2742471).
Mind you, even if we check that the half-dead page at some upper level
is eventually deletable because it's not the rightmost child, it might
become the rightmost child if we don't hold the lock so that the next
vacuum cannot delete it, and we're back to square one.

We could just punt if more than X pages would need to be changed. That
would mean that we never delete pages at the top (h - X) levels of the
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
As another data point, in the worst case the keys are so wide that only
2 keys fit on each B-tree page. With that assumption, the tree can be at
most 32 levels deep if you just insert into it with no deletions, since
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
sure). Deletions make it more complicated, but I would be pretty
surprised if you can construct a B-tree tallers than, say 40 levels.

Things gets tricky if shared_buffers is very small; with
shared_buffers=16, you certainly can't hold more than 16 buffers pinned
at once. (in fact, I think ginfast.c already has a problem with that...)

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-09 17:18:10
Message-ID:	527E6E52.5060402@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.11.2013 18:49, Heikki Linnakangas wrote:
> We could just punt if more than X pages would need to be changed. That
> would mean that we never delete pages at the top (h - X) levels of the
> tree. In practice that should be fine if X is high enough.
> As a data point, GIN list page deletion holds 16 pages locked at once
> (GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
> As another data point, in the worst case the keys are so wide that only
> 2 keys fit on each B-tree page. With that assumption, the tree can be at
> most 32 levels deep if you just insert into it with no deletions, since
> MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
> sure). Deletions make it more complicated, but I would be pretty
> surprised if you can construct a B-tree tallers than, say 40 levels.

On further thought, it's worse than that. To delete a page, you need to
lock the left and right siblings, so you need 3 pages locked per each
level you delete...

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-09 17:40:29
Message-ID:	527E738D.20005@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.11.2013 19:18, Heikki Linnakangas wrote:
> On 09.11.2013 18:49, Heikki Linnakangas wrote:
>> We could just punt if more than X pages would need to be changed. That
>> would mean that we never delete pages at the top (h - X) levels of the
>> tree. In practice that should be fine if X is high enough.
>> As a data point, GIN list page deletion holds 16 pages locked at once
>> (GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
>> As another data point, in the worst case the keys are so wide that only
>> 2 keys fit on each B-tree page. With that assumption, the tree can be at
>> most 32 levels deep if you just insert into it with no deletions, since
>> MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
>> sure). Deletions make it more complicated, but I would be pretty
>> surprised if you can construct a B-tree tallers than, say 40 levels.
>
> On further thought, it's worse than that. To delete a page, you need to
> lock the left and right siblings, so you need 3 pages locked per each
> level you delete...

On further further thought, we don't need to unlink the pages
immediately. It's enough to mark them as half-dead and remove the
downlink to the upmost half-dead page. Marking a page as half-dead is as
good as deleting it outright as far as searches and insertions are
concerned. Actually unlinking the pages from the left and right siblings
can be done at any later time, and doesn't need to be done in any
particular order.

So, my original musings about the number of concurrent locks needed
still holds.

- Heikki

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-09 23:47:33
Message-ID:	CA+TgmoZQRJyhwMc_y-2rugbOL7LpTp=3+yW2KaXvTkOJ653C1w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 9, 2013 at 12:40 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 09.11.2013 19:18, Heikki Linnakangas wrote:
>> On 09.11.2013 18:49, Heikki Linnakangas wrote:
>>> We could just punt if more than X pages would need to be changed. That
>>> would mean that we never delete pages at the top (h - X) levels of the
>>> tree. In practice that should be fine if X is high enough.
>>> As a data point, GIN list page deletion holds 16 pages locked at once
>>> (GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
>>> As another data point, in the worst case the keys are so wide that only
>>> 2 keys fit on each B-tree page. With that assumption, the tree can be at
>>> most 32 levels deep if you just insert into it with no deletions, since
>>> MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
>>> sure). Deletions make it more complicated, but I would be pretty
>>> surprised if you can construct a B-tree tallers than, say 40 levels.
>>
>> On further thought, it's worse than that. To delete a page, you need to
>> lock the left and right siblings, so you need 3 pages locked per each
>> level you delete...
>
> On further further thought, we don't need to unlink the pages immediately.
> It's enough to mark them as half-dead and remove the downlink to the upmost
> half-dead page. Marking a page as half-dead is as good as deleting it
> outright as far as searches and insertions are concerned. Actually unlinking
> the pages from the left and right siblings can be done at any later time,
> and doesn't need to be done in any particular order.
>
> So, my original musings about the number of concurrent locks needed still
> holds.

I think we've tried pretty hard to avoid algorithms where the maximum
number of lwlocks that must be held at one time is not a constant, and
I think we're in for a bad time of it if we start to deviate from that
principal. I'm not sure what to do about this problem, but I think
locking N levels of the tree at once, where N can be as large as the
tree is deep, is probably a bad plan, whether the number of locks
required is N or 3N.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-11 08:03:00
Message-ID:	52808F34.6020000@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10.11.2013 01:47, Robert Haas wrote:
> I think we've tried pretty hard to avoid algorithms where the maximum
> number of lwlocks that must be held at one time is not a constant, and
> I think we're in for a bad time of it if we start to deviate from that
> principal. I'm not sure what to do about this problem, but I think
> locking N levels of the tree at once, where N can be as large as the
> tree is deep, is probably a bad plan, whether the number of locks
> required is N or 3N.

I think I found a solution that accomplishes that. It's actually not
that complicated:

Like we currently do, first climb up the tree to check that it's safe to
delete, ie. the downlink in the first non-empty parent is not the
rightmost entry. But when we reach the level where the parent is
non-empty - I'll call that the "topmost" parent - we keep that page
locked. The leaf page is kept locked while we climb.

This is enough to plug the race condition. As long as we hold a lock on
the topmost parent containing the downlink to the branch of pages we're
about to delete, it cannot become the rightmost entry. And as long as we
hold a lock on the leaf page, no new insertions can happen on any of the
internal pages in the branch, as insertions to internal pages only
happen when a child is split. However, the rest of the algorithm needs
to be slightly modified, as we cannot re-lock the lower-level pages
until we release the lock on the topmost page, to avoid deadlock.

So at this point, we hold two locks: the leaf page, and the topmost
parent containing the downlink to the branch we're deleting. Next, we
remove the downlink from the topmost parent, and mark the leaf page as
half-dead in one atomic operation. Also, a pointer to the highest
internal page in the branch we're deleting - the one the removed
downlink pointed to - is put on the leaf page. We can now release the locks.

At this point, searches and inserts work fine. The leaf page has been
marked as half-dead, so any insertions to the deleted page's keyspace
will go to its right sibling. The downlink is to the top of the branch
is gone, so even if the right sibling is split many times, any keys in
the transferred keyspace that propagate to the higher levels won't be
out-of-order.

All that is left is to unlink the all the lingering pages in the branch
we're deleting from their left and right siblings. This can be done at
any later time, and if we error out or crash for some reason, next
vacuum that comes along can finish the job. This is done one level at a
time. Lock the leaf page, and the internal page the leaf page points to,
and the internal page's left and right siblings (in the right order, not
this order). Change the left and right sibling's right- and left-links,
mark the internal page as deleted, and update the pointer in the leaf
page to point to the child of the deleted internal page. Then recurse to
the child, until we reach the leaf level.

This has the nice extra property that we don't need the
incomplete-action tracking in WAL recovery. I'd like to get rid of that
anyway.

I'm not sure what to do about stable branches. This could be
back-patched, with the caveat that this introduces new WAL record types
so standbys need to be upgraded before the master. But given the lack of
field reports in the ten years this race condition has existed, I'm not
sure it's worth the hassle.

- Heikki

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-12 15:13:07
Message-ID:	CA+TgmoYShtMdL-qikpYmHzC+94sb_oPRWGpCYfU-fBxnBmM=uw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 10.11.2013 01:47, Robert Haas wrote:
>> I think we've tried pretty hard to avoid algorithms where the maximum
>> number of lwlocks that must be held at one time is not a constant, and
>> I think we're in for a bad time of it if we start to deviate from that
>> principal. I'm not sure what to do about this problem, but I think
>> locking N levels of the tree at once, where N can be as large as the
>> tree is deep, is probably a bad plan, whether the number of locks
>> required is N or 3N.
>
> I think I found a solution that accomplishes that. It's actually not that
> complicated:
>
> Like we currently do, first climb up the tree to check that it's safe to
> delete, ie. the downlink in the first non-empty parent is not the rightmost
> entry. But when we reach the level where the parent is non-empty - I'll call
> that the "topmost" parent - we keep that page locked. The leaf page is kept
> locked while we climb.
>
> This is enough to plug the race condition. As long as we hold a lock on the
> topmost parent containing the downlink to the branch of pages we're about to
> delete, it cannot become the rightmost entry. And as long as we hold a lock
> on the leaf page, no new insertions can happen on any of the internal pages
> in the branch, as insertions to internal pages only happen when a child is
> split. However, the rest of the algorithm needs to be slightly modified, as
> we cannot re-lock the lower-level pages until we release the lock on the
> topmost page, to avoid deadlock.
>
> So at this point, we hold two locks: the leaf page, and the topmost parent
> containing the downlink to the branch we're deleting. Next, we remove the
> downlink from the topmost parent, and mark the leaf page as half-dead in one
> atomic operation. Also, a pointer to the highest internal page in the branch
> we're deleting - the one the removed downlink pointed to - is put on the
> leaf page. We can now release the locks.
>
> At this point, searches and inserts work fine. The leaf page has been marked
> as half-dead, so any insertions to the deleted page's keyspace will go to
> its right sibling. The downlink is to the top of the branch is gone, so even
> if the right sibling is split many times, any keys in the transferred
> keyspace that propagate to the higher levels won't be out-of-order.
>
> All that is left is to unlink the all the lingering pages in the branch
> we're deleting from their left and right siblings. This can be done at any
> later time, and if we error out or crash for some reason, next vacuum that
> comes along can finish the job. This is done one level at a time. Lock the
> leaf page, and the internal page the leaf page points to, and the internal
> page's left and right siblings (in the right order, not this order). Change
> the left and right sibling's right- and left-links, mark the internal page
> as deleted, and update the pointer in the leaf page to point to the child of
> the deleted internal page. Then recurse to the child, until we reach the
> leaf level.
>
> This has the nice extra property that we don't need the incomplete-action
> tracking in WAL recovery. I'd like to get rid of that anyway.

I am not very knowledgeable of this code, but at least with my limited
knowledge I can't spot any problems with that approach. The only
thing that seems a little unfortunate is that the next vacuum may need
to finish cleaning things up; I thought at one point you were trying
to get rid of the amvacuumcleanup stuff. But maybe I'm
misremembering, and anyway it still seems like a step forward.

> I'm not sure what to do about stable branches. This could be back-patched,
> with the caveat that this introduces new WAL record types so standbys need
> to be upgraded before the master. But given the lack of field reports in the
> ten years this race condition has existed, I'm not sure it's worth the
> hassle.

In the absence of at least one (and maybe several) reports from the
field, -1 for back-patching this. At this point, it seems like far
more people will be confused by the need to upgrade things in order
than will benefit from the fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-13 16:04:12
Message-ID:	5283A2FC.6040606@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12.11.2013 17:13, Robert Haas wrote:
> On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> All that is left is to unlink the all the lingering pages in the branch
>> we're deleting from their left and right siblings. This can be done at any
>> later time, and if we error out or crash for some reason, next vacuum that
>> comes along can finish the job. This is done one level at a time. Lock the
>> leaf page, and the internal page the leaf page points to, and the internal
>> page's left and right siblings (in the right order, not this order). Change
>> the left and right sibling's right- and left-links, mark the internal page
>> as deleted, and update the pointer in the leaf page to point to the child of
>> the deleted internal page. Then recurse to the child, until we reach the
>> leaf level.
>>
>> This has the nice extra property that we don't need the incomplete-action
>> tracking in WAL recovery. I'd like to get rid of that anyway.
>
> I am not very knowledgeable of this code, but at least with my limited
> knowledge I can't spot any problems with that approach. The only
> thing that seems a little unfortunate is that the next vacuum may need
> to finish cleaning things up; I thought at one point you were trying
> to get rid of the amvacuumcleanup stuff. But maybe I'm
> misremembering, and anyway it still seems like a step forward.

Ah no, it only requires the next vacuum to clean up, if the first vacuum
is interrupted for some reason. Normally, all of the steps are performed
one after another in the same vacuum process, and half-dead pages will
only be seen by backends that look at the affected pages at the same
instant. But in case of an error or crash, the next vacuum will pick up
where the previous one left off.

Here's a patch implementing this scheme. One noteworthy detail if you're
familiar with the old scheme: in this patch, only leaf pages are marked
as half-dead. Never internal pages. This is the opposite of what we do
currently; currently only internal pages can be marked half-dead, never
leaf pages. This also gives us a chance to tell apart half-dead pages
lingering after pg_upgrade from an older version, and pages marked as
half-dead with the new code. The incomplete-action mechanism should have
ensured that the former don't exist, but in case something went wrong
some time before the upgrade and there is such a page in the tree after
all, the patch gives a LOG message about it, advising to reindex.
Searches will work fine in any case, but insertions to unfortunate pages
might cause page splits and out-of-order keys in the upper levels, if
the pre-upgrade half-dead pages are not cleaned up by reindexing.

- Heikki

Attachment	Content-Type	Size
fix-btree-page-deletion-2.patch	text/x-diff	69.7 KB

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-11-15 14:30:51
Message-ID:	5286301B.7010004@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/13/13, 11:04 AM, Heikki Linnakangas wrote:
> Here's a patch implementing this scheme.

Compiler warnings:

nbtpage.c: In function ‘_bt_pagedel’:
nbtpage.c:1695:24: warning: ‘metad’ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1414:18: note: ‘metad’ was declared here
nbtpage.c:1730:117: warning: ‘metapg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1413:7: note: ‘metapg’ was declared here

From:	Kevin Grittner <kgrittn(at)ymail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-12-17 01:57:51
Message-ID:	1387245471.22433.YahooMailNeo@web162901.mail.bf1.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> Here's a patch implementing this scheme.

I've read through the papers on the btree technique, read the
thread, and have been reviewing the patch. It will take me another
day or two to complete a close review and testing, but I wanted to
give preliminary results, and just let people know that I am
working on it.

After reading through the papers and the README for btree, I have
to say that the approach to deletion seemed like the weak link.
Searches, scans, and insertions all were pretty cohesive. The
logic was all based on the fact that missing *upper* levels was OK
because the sibling pointers would be automatically used to search
until the correct page was found at each level. The old deletion
approach turned this on its head by having missing pages at the
bottom, which introduces whole new classes of problems which
otherwise don't exist. This patch makes interim states in the
deletion process look almost exactly the same as interim states in
the insertion process, which I think is A Very Good Thing.

The approach to limiting the number of locks to a finite (and
small) number is clever, but seems quite sound. The locks are
always taken in an order which cannot cause a deadlock.

The write-up lacks any explicit mention of what happens if the
branch being considered for deletion reaches all the way to the
root. Although an astute reader will notice that since root page
will always be the only page at its level, it must always be the
rightmost page at its level, and therefore is correctly handled by
the logic dealing with that, I think it merits an explicit mention
in the README.

I agree with others that this patch is not suitable for
back-patching. I wonder whether some other solution might be worth
back-patching, since it is clearly a bug. The "never delete
internal pages" might be safe enough to consider. It would lead to
some degree of bloat, but perhaps bloat is better than errors. Of
course, the argument can certainly be made that people hit the bug
so rarely that we're better off just leaving it alone in stable
branches. Anyway, that would be a different patch.

More on *this* patch in a day or two.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-12-17 02:55:08
Message-ID:	52AFBD0C.5050208@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/9/13, 10:02 AM, Heikki Linnakangas wrote:
> 3. Another approach would be to get rid of the "can't delete rightmost child" limitation. We currently have that limitation because it ensures that we never need to change the high-key of a page. If we delete a page that is the rightmost child of its parent, we transfer the deleted keyspace from the parent page to its right sibling. To do that, we need to update the high key of the parent, as well as the downlink of the right sibling at the grandparent level. That's a bit complicated, because updating the high key might require splitting the page.

Is the rightmost child issue likely to affect indexes on increasing values, like a queue table? Because we get significant bloat on our indexes, especially ones that are on things like timestamps in a queue table (FIFO queue, not a LIFO stack):

INFO: index "page_hits_pkey" now contains 6083 row versions in 9363 pages

INFO: vacuuming "cnu_stats.page_hits_queue"
INFO: scanned index "page_hits_pkey" to remove 4329 row versions
DETAIL: CPU 0.07s/0.02u sec elapsed 0.60 sec.
INFO: "page_hits_queue": removed 4329 row versions in 436 pages
DETAIL: CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO: index "page_hits_pkey" now contains 7891 row versions in 9363 pages
DETAIL: 4329 index row versions were removed.
9338 index pages have been deleted, 9227 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "page_hits_queue": found 4329 removable, 7891 nonremovable row versions in 548 out of 548 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 10210 unused item pointers.
0 pages are entirely empty.
CPU 0.08s/0.02u sec elapsed 0.62 sec.
INFO: vacuuming "pg_toast.pg_toast_25287"
INFO: index "pg_toast_25287_index" now contains 0 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_25287": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-12-17 07:56:09
Message-ID:	52B00399.4020804@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/17/2013 03:57 AM, Kevin Grittner wrote:
> I agree with others that this patch is not suitable for
> back-patching. I wonder whether some other solution might be worth
> back-patching, since it is clearly a bug. The "never delete
> internal pages" might be safe enough to consider. It would lead to
> some degree of bloat, but perhaps bloat is better than errors. Of
> course, the argument can certainly be made that people hit the bug
> so rarely that we're better off just leaving it alone in stable
> branches. Anyway, that would be a different patch.

I don't think "never delete internal pages" would be acceptable. In
particular, it would bloat indefinitely for a smallish FIFO queue table
with a timestamp column.

- Heikki

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2013-12-21 13:50:26
Message-ID:	1387633826.30013.3.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This patch didn't make it out of the 2013-11 commit fest. You should
move it to the next commit fest (probably with an updated patch)
before January 15th, if it is not resolved before then.

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2014-01-11 23:45:43
Message-ID:	CAM3SWZTvA11sRYjZU+h60i+5Szcq=kmmvrWHDgi01m2x7iU-Zg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 21, 2013 at 5:50 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> This patch didn't make it out of the 2013-11 commit fest. You should
> move it to the next commit fest (probably with an updated patch)
> before January 15th, if it is not resolved before then.

Uh, it's already in the January commitfest.

--
Peter Geoghegan

From:	Kevin Grittner <kgrittn(at)ymail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2014-01-18 19:45:23
Message-ID:	1390074323.55520.YahooMailNeo@web122304.mail.ne1.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> Here's a patch implementing this scheme.

This patch fixes a race condition bug in btree entry deletion. The
bug seems to rarely be encountered in practice; or at least it is
generally not recognized as the cause of index corruption when
that occurs.

The basic approach is sound. The papers on which we based our
btree implementation did not discuss entry deletion, and our
current approach introduces new possible states into the on-disk
index structure which are not covered by the papers and are not
always handled correctly by the current code. Specifically, entry
inserts can add new pages at the lower levels which were not (yet)
in any higher-level pages, but right-sibling pointers allow allow
the correct point in the index to be found. Our existing deletion
approach turned this on its head by having missing pages at the
bottom, which introduces whole new classes of problems which
otherwise don't exist. It is in handling these index states which
are not addressed in the academic papers that we have a race
condition. This patch makes interim states in the deletion process
look almost exactly the same as interim states in the insertion
process, thus dodging the need to handle new states.

The approach to limiting the number of locks to a finite (and
small) number is clever, but seems quite sound. The locks are
always taken in an order which cannot cause a deadlock.

The general feeling is that this patch is not suitable for
back-patching. I agree with this, but this is a bug which could
lead to index corruption which exists in all supported branches.
If nobody can suggest an alternative which we can back-patch, we
might want to reconsider this after this has been in production
releases for several months.

The patch still applies with a few offsets. It compiles with these
warnings (which were also reported by Peter Eisentraut and should
be fixed):

nbtpage.c: In function ‘_bt_pagedel’:
nbtpage.c:1695:24: warning: ‘metad’ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1414:18: note: ‘metad’ was declared here
nbtpage.c:1730:4: warning: ‘metapg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1413:8: note: ‘metapg’ was declared here

I'm not a fan of the new retry label introduced in _bt_pagedel()
and the goto statement which references it. I think a loop using
while() or for() would be cleaner. The whole idea of describing
this logic recursively and then making a big deal about using tail
recursion as an optimization seems pointless and confusing. It
could just as easily be described as an iterative approach and save
the extra logical step in explaining the code. I think that would
be clearer and easier to understand. It would also eliminate the
last parameter, which is always passed as NULL and only set to
something else internally. I think it would be cleaner to make
that a local variable initialized to NULL as part of eliminating
the fiction of recursion here.

There is a point in this loop where we return 0, but I think that
is wrong; I think we should break so that the pages already deleted
will be counted. On a related note. the caller doesn't actually
use the count, but assumes that any non-zero value should be
treated as 1.

Similar issues with faux recursion existing the calling function,
btvacuumpage(). In that case the unnecessary parameter which the
caller must get right is orig_blkno, which must be the same as
blkno on an actual call. That could be a local variable
initialized to blkno within the function. I think this would be
better described as iteration directly rather than claiming
recursions and whining about how the compiler is too dumb to turn
it into iteration for us. It's not the job of this patch to fix
that in the caller, but let's not emulate it.

Other than that, I have not found any problems.

Since this is fixing a bug which is a hard-to-hit race condition,
it is hard to test. Pretty much you need to get into a debugger
and set breakpoints to cause the right timing to be sure of hitting
it. I'm still playing with that, but I figured I should post these
notes from reviewing the source code while I continue with that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2014-01-27 14:34:44
Message-ID:	52E66E84.2050109@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for the review!

On 01/18/2014 09:45 PM, Kevin Grittner wrote:
> The patch still applies with a few offsets. It compiles with these
> warnings (which were also reported by Peter Eisentraut and should
> be fixed):
>
> nbtpage.c: In function ‘_bt_pagedel’:
> nbtpage.c:1695:24: warning: ‘metad’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> nbtpage.c:1414:18: note: ‘metad’ was declared here
> nbtpage.c:1730:4: warning: ‘metapg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> nbtpage.c:1413:8: note: ‘metapg’ was declared here

Fixed.

> I'm not a fan of the new retry label introduced in _bt_pagedel()
> and the goto statement which references it. I think a loop using
> while() or for() would be cleaner. The whole idea of describing
> this logic recursively and then making a big deal about using tail
> recursion as an optimization seems pointless and confusing. It
> could just as easily be described as an iterative approach and save
> the extra logical step in explaining the code. I think that would
> be clearer and easier to understand. It would also eliminate the
> last parameter, which is always passed as NULL and only set to
> something else internally. I think it would be cleaner to make
> that a local variable initialized to NULL as part of eliminating
> the fiction of recursion here.

Ok, changed it to a loop, and moved the responsibility to construct the
'stack' into _bt_pagedel().

> There is a point in this loop where we return 0, but I think that
> is wrong; I think we should break so that the pages already deleted
> will be counted.

Fixed.

> On a related note. the caller doesn't actually
> use the count, but assumes that any non-zero value should be
> treated as 1.

Hmm. The caller does this:

> ndel = _bt_pagedel(rel, buf);
>
> /* count only this page, else may double-count parent */
> if (ndel)
> stats->pages_deleted++;

The problem here is that btvacuumpage() might visit the parent page
again, and count it again. The code is erring on the side of
undercounting; I'm not sure which is better.

> Similar issues with faux recursion existing the calling function,
> btvacuumpage(). In that case the unnecessary parameter which the
> caller must get right is orig_blkno, which must be the same as
> blkno on an actual call. That could be a local variable
> initialized to blkno within the function. I think this would be
> better described as iteration directly rather than claiming
> recursions and whining about how the compiler is too dumb to turn
> it into iteration for us. It's not the job of this patch to fix
> that in the caller, but let's not emulate it.

Agreed.

> Other than that, I have not found any problems.

While fixing the above issues, and reviewing this again in general, I
noticed that btree_xlog_unlink_page() was a few bricks shy of a load.
When unlinking an internal page, it didn't update the pointer to the new
topmost page in the to-be-deleted chain. Fixed that.

Attached is a new version. I also re-worded the README changes; I hope
it reads better now.

- Heikki

Attachment	Content-Type	Size
fix-btree-page-deletion-3.patch	text/x-diff	64.1 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2014-01-27 14:39:19
Message-ID:	52E66F97.9090804@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/17/2013 04:55 AM, Jim Nasby wrote:
> On 11/9/13, 10:02 AM, Heikki Linnakangas wrote:
>> 3. Another approach would be to get rid of the "can't delete
>> rightmost child" limitation. We currently have that limitation
>> because it ensures that we never need to change the high-key of a
>> page. If we delete a page that is the rightmost child of its
>> parent, we transfer the deleted keyspace from the parent page to
>> its right sibling. To do that, we need to update the high key of
>> the parent, as well as the downlink of the right sibling at the
>> grandparent level. That's a bit complicated, because updating the
>> high key might require splitting the page.
>
> Is the rightmost child issue likely to affect indexes on increasing
> values, like a queue table?

No. In a FIFO queue, you keep adding new items to the right-end of the
index, so old pages become non-rightmost fairly quickly.

- Heikki

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2014-01-28 01:55:46
Message-ID:	CAM3SWZRH2hXWBC+Q5DaYvSaRHV+u3pPfY=kNGpBBkPY+xwqiPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 18, 2014 at 11:45 AM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
>
>> Here's a patch implementing this scheme.

I thought I'd weigh in here too, since this is closely tied to the
page split patch, which is dependent on this patch.

> The basic approach is sound. The papers on which we based our
> btree implementation did not discuss entry deletion, and our
> current approach introduces new possible states into the on-disk
> index structure which are not covered by the papers and are not
> always handled correctly by the current code.

Lehman and Yao don't discuss deletion. But V. Lanin and D. Shasha. do,
in the paper "A Symmetric Concurrent B-Tree Algorithm"[1], as the
README notes - we currently follow a "simplified" version of that.
Apparently a number of people proposed solutions to address the
omission of L&Y, as one survey of those approaches notes [2]. One of
the approaches appearing in that survey is L&S.

The classic L&Y algorithm only has right-links, and not left-links.
The same applies to L&S. But I'd describe this patch as bringing what
we do closer to L&S, which is more or less a refinement of B-Link
trees (that is, L&Y B-Trees). L&S does have a concept of an outlink,
crucial to their "symmetric deletion approach", which is something
that we don't need, because we already have left-links (we have them
primarily to support backwards index scans). L&S says "In general, the
link technique calls for processes that change sections of a data
structure in a way that would cause other processes to fail to provide
a link to another section of the data structure where recovery is
possible". So loosely speaking we're doing a "conceptual
_bt_moveleft()" for deletion as the inverse of a literal _bt_moveright
for insertion.

L&S says "a deletion needs no more than two write locks at a time
during its ascent" (the descent is just a garden variety _bt_search()
insertion scankey descent). However, we currently may obtain as many
as 4 write locks for "page deletion". As the README notes:

"""
To delete an empty page, we acquire write lock on its left sibling (if
any), the target page itself, the right sibling (there must be one), and
the parent page, in that order.
"""

We obtain 2 write locks in the first phase (on target and parent). In
the second phase, we do what is described above: lock 4 pages in that
order. However, the reason for this difference from the L&S paper is
made clear in "2.5 Freeing Empty nodes", which kind of cops out of
addressing how to *unlink* empty/half-dead pages, with a couple of
brief descriptions of approaches that others have come up with that
are not at all applicable to us.

For that reason, might I suggest that you change this in the README:

+ Page Deletion
+ -------------

I suggest that it should read "Page Deletion and Unlinking" to
emphasize the distinction.

> The general feeling is that this patch is not suitable for
> back-patching. I agree with this, but this is a bug which could
> lead to index corruption which exists in all supported branches.
> If nobody can suggest an alternative which we can back-patch, we
> might want to reconsider this after this has been in production
> releases for several months.

That was what I thought. Let's revisit the question of back-patching
this when 9.4 has been released for 6 months or so.

> Other than that, I have not found any problems.

Incidentally, during my research of these techniques, I discovered
that Johnson and Shasha argue that "for most concurrent B-tree
applications, merge-at-empty produces significantly lower
restructuring rates, and only a slightly lower space utilization, than
merge-at-half". This is covered in the paper "B-trees with inserts,
deletes, and modifies: Why Free-at-empty is Better than Merge-at-half"
[3]. I was pleased to learn that there was a sound, well regarded
theoretical basis for that behavior, even if the worst-case bloat can
be unpleasant. Though having said that, that paper doesn't weigh what
we do in the event of non-HOT updates, and that could change the
equation. That isn't an argument against merge-at-empty; it's an
argument against the current handling of non-HOT updates.

Currently, the comments above _bt_doinsert() say at one point:

* This routine is called by the public interface routines, btbuild
* and btinsert.

This is obsolete; since commit 89bda95d, only the insert public
interface routine calls _bt_doinsert(). Perhaps fix this in passing.

[1] L&S: https://archive.org/stream/symmetricconcurr00lani/symmetricconcurr00lani_djvu.txt

[2] http://www.dtic.mil/get-tr-doc/pdf?AD=ADA232287&Location=U2&doc=GetTRDoc.pdf
(Chapter 3, "The Coherent Shared Memory Algorithm")

[3] http://cs.nyu.edu/shasha/papers/bjournal.ps
--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2014-01-28 11:13:43
Message-ID:	52E790E7.60400@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/27/2014 04:34 PM, Heikki Linnakangas wrote:
> While fixing the above issues, and reviewing this again in general, I
> noticed that btree_xlog_unlink_page() was a few bricks shy of a load.
> When unlinking an internal page, it didn't update the pointer to the new
> topmost page in the to-be-deleted chain. Fixed that.
>
> Attached is a new version. I also re-worded the README changes; I hope
> it reads better now.

I dusted off the scripts I posted yesterday, to visualize b-tree
structure using graphviz, and tried it on this patch. I found a number
of bugs. Attached is a new version with those bugs fixed.

- Heikki

Attachment	Content-Type	Size
fix-btree-page-deletion-4.patch	text/x-diff	64.5 KB