Nasty btree deletion bug

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
Subject: Nasty btree deletion bug
Date: 2006-10-25 17:14:56
Message-ID: 2766.1161796496@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been analyzing Ed L's recent report of index corruption:
http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php
(thanks to Ed for letting me have the actual index to study).
I think I've figured out what happened to it. nbtree/README says

: The notion of a half-dead page means that the key space relationship between
: the half-dead page's level and its parent's level may be a little out of
: whack: key space that appears to belong to the half-dead page's parent on the
: parent level may really belong to its right sibling. We can tolerate this,
: however, because insertions and deletions on upper tree levels are always
: done by reference to child page numbers, not keys. The only cost is that
: searches may sometimes descend to the half-dead page and then have to move
: right, rather than going directly to the sibling page.

but unfortunately this analysis is too simplistic. In the situation
where a half-dead page is its parent's rightmost child (which is the
only case where we expect half-deadness to persist long), that page's
former key space now belongs to its right sibling, which means that
some keys that are less than the parent's "high key" now belong to the
keyspace of pages below the parent's right sibling. This is OK as far
as search behavior goes --- but suppose that we get a continuing stream
of insertions of new keys in that key range. This will result in page
splits that cause keys in that key range to bubble up into the upper
levels. If that keeps happening long enough, eventually we will split
the parent's right sibling at a key value less than the parent's high
key, and then we will insert that key into the grandparent level just
to the right of the parent's right sibling. Now we have an index
that's actually corrupt, because we have out-of-order index keys in
the grandparent level, which can cause searches for keys in their range
to fail (a search may descend too far to the right to find the entries
it should have found).

Since only internal pages can be half-dead, this failure requires at
least a three-level index, and it requires enough deletions within a
small range for for a level-1 page to become empty (hence half dead)
followed by a large number of insertions in that same range. Ed's
index was probably more prone to this than average because he was
indexing very wide values (~500 byte text), leading to low btree fanout
and a relatively narrow value range for a level-1 page. Still I'm a
bit surprised that we've not figured it out before, because the bug is
presumably present all the way back to 7.4 when the btree deletion code
was added.

I haven't thought of a suitable fix yet --- clearly we're going to have
to change the concept of half-deadness to some extent. But I have to
leave for a dentist appointment, so I figured I'd post what I know.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 00:05:41
Message-ID: 6392.1161821141@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I've been analyzing Ed L's recent report of index corruption:
> http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php

After some thought, it still seems to me to be impractical to delete the
rightmost child of a btree page while other children remain. Doing this
would require either moving the child's keyspace assignment left to its
left sibling, or moving that keyspace right to the first child of the
parent's right sibling, and either of these choices means having to
adjust page "high keys" in a way that's not certain to work. (For
instance, moving the keyspace left means assigning the victim page's
high key to its left sibling, and there might not be room in the left
sibling page if the desired key is longer than what's there. In the
other case the same problem of having to replace a key with a
potentially longer one exists, but it manifests at the grandparent
level.)

So I think the rule needs to be "don't delete the rightmost child unless
it's the only child, in which case you can delete the parent too --- but
the same restriction must be observed at the next level up".

I believe we can handle this by doing a precheck before starting a
delete of a level-zero page: scan up the stack to check that the
condition is satisfied for each level that needs to be deleted.
The tricky part is that we don't want to exclusive-lock all those
pages at once (we could do it without deadlock risk, but the concurrency
hit seems bad). I think though that we can assume that if the condition
is checked it will remain true until we finish the delete:

1. A page that isn't rightmost child can't become so while we're not
looking, because that would require a delete to occur, and only VACUUM
does index page deletes, and we already disallow two concurrent VACUUMs
on the same index.

2. A page that is an only child could acquire a sibling only if it's
split, but that would imply an insert (in fact multiple inserts) into
the original level-zero page. We'll recheck emptiness of the level-zero
page after we acquire write lock on it to begin the actual deletion,
at which point it's still safe to abandon the delete attempt.

The recent patch to allow ordinary non-vacuum processes to delete index
entries retail makes #2 a little trickier than meets the eye. One could
imagine a scenario where between the times VACUUM leaves the level-zero
page and reacquires lock on it, enough entries were inserted to split
the page and then they all got deleted again by that patch. However
that patch in its present form cannot leave a page in a completely
empty state, because it's only invoked as part of an insertion attempt.
(If it did manage to delete all the existing entries, then the same
process would insert a new entry onto the same page before unlocking
it.) So I think it's OK, but we'll need to be wary about any proposals
to remove index entries in other contexts.

The concept of a half-dead page would remain, but it'd be a transient
state that would normally only persist for a moment between atomic
page-delete actions. If we crash between two such actions, the
half-dead page would remain present, but would be found and cleaned up
by the next VACUUM. In the meantime it wouldn't cause any problem
because the keyspace it gives up will belong to a sibling of the same
parent at whatever level the delete is ultimately supposed to stop at,
and so inserts and even splits in that keyspace won't create an
inconsistency. Alternatively, we could have WAL crash recovery complete
the multilevel deletion using the same sort of remember-pending-actions
logic we use now to handle splits.

Comments? Have I missed anything?

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 15:00:15
Message-ID: 4540CD7F.30909@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I wrote:
>> I've been analyzing Ed L's recent report of index corruption:
>> http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php

Auch. That's nasty indeed.

> So I think the rule needs to be "don't delete the rightmost child unless
> it's the only child, in which case you can delete the parent too --- but
> the same restriction must be observed at the next level up".
> ....
> The concept of a half-dead page would remain, but it'd be a transient
> state that would normally only persist for a moment between atomic
> page-delete actions. If we crash between two such actions, the
> half-dead page would remain present, but would be found and cleaned up
> by the next VACUUM. In the meantime it wouldn't cause any problem
> because the keyspace it gives up will belong to a sibling of the same
> parent at whatever level the delete is ultimately supposed to stop at,
> and so inserts and even splits in that keyspace won't create an
> inconsistency.

I don't understand how this "in the meantime" thing works. I tried to
work out a step-by-step example, could you take a look at it? See
http://users.tkk.fi/~hlinnaka/pgsql/btree-deletion-bug/

> ...
>
> Comments? Have I missed anything?

It took me a lot of time with pen and paper to understand the issue. And
I'm not sure I still understood it fully. The logic is very complex,
which is bad for maintainability in itself :(.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 15:47:29
Message-ID: 21061.1161877649@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I don't understand how this "in the meantime" thing works. I tried to
> work out a step-by-step example, could you take a look at it? See
> http://users.tkk.fi/~hlinnaka/pgsql/btree-deletion-bug/

[ looks at that for a bit... ] Yeah, you're right. Once the deletion
is completed, the F lower-bound key will disappear from the grandparent,
which would restore consistency --- but we could have already delivered
wrong search answers, so that won't do.

In theory, given a slow-enough-moving VACUUM process, this could happen
even without a crash. So I think that means we have to go over to the
other plan of locking everything all the way up to the top of the
deletion before we start doing it --- and also, we'll need crash
recovery logic to complete the incomplete deletion, same as for
incomplete inserts. Yech --- more code than I was hoping to have to
write, and more concurrency hit too.

OTOH we might be able to get rid of the notion of "half dead", which
would allow some simplifications.

Thanks for taking a close look!

regards, tom lane


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 16:31:30
Message-ID: 4540E2E2.6040305@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> In theory, given a slow-enough-moving VACUUM process, this could happen
> even without a crash. So I think that means we have to go over to the
> other plan of locking everything all the way up to the top of the
> deletion before we start doing it --- and also, we'll need crash
> recovery logic to complete the incomplete deletion, same as for
> incomplete inserts. Yech --- more code than I was hoping to have to
> write, and more concurrency hit too.

ISTM we don't necessarily need the "complete the incomplete deletion"
logic. Since we're holding all the pages up to the parent of the highest
deleted page locked, we can atomically issue one big WAL record covering
all the modifications. We can't do that with page splits, because we
want to release the lock on the split pages before we move up to the
parent to insert the child pointer. (whether or not it's worth it in
page splits either is another issue)

The concurrency hit probably isn't that bad. There shouldn't be much
activity on empty pages.

What does the original research paper by Lanin & Shasha say about this?
I don't have it at hand. I do have the Lehman & Yao paper but that one
doesn't discuss removal of non-leaf pages at all.

> OTOH we might be able to get rid of the notion of "half dead", which
> would allow some simplifications.

Yep.

> Thanks for taking a close look!

No problem, I've been staring at the b-tree code lately anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 16:31:50
Message-ID: 21633.1161880310@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> [ looks at that for a bit... ] Yeah, you're right. Once the deletion
> is completed, the F lower-bound key will disappear from the grandparent,
> which would restore consistency --- but we could have already delivered
> wrong search answers, so that won't do.

On further reflection, I think I understand why we've not realized the
existence of this bug before: in fact, it *doesn't* lead to wrong search
answers. I think the only visible consequence is exactly the "failed to
re-find parent key" VACUUM error that Ed saw. The reason is that the
key misordering in the grandparent level is nearly harmless. Using your
example of

- F D D ...

* if we happen to come across the F key first during a binary search of
the grandparent page, and we are looking for something <= F, we will
descend to its left, which is at worst a little bit inefficient:
_bt_moveright will still ensure that we find what we seek.

* if we happen to visit one of the D key(s) first, and we are looking
for something > D, we will descend to the right of that key. Well,
that's not incorrect for the live data. In fact, the *only* key in the
tree that we will fail to find this way is the F bounding key for the
half-dead page itself (or one of its also-deletable parents). So that's
exactly why VACUUM can fail while trying to clean up the half-dead page,
and it's why we're not seeing reports of wrong query answers.

So that reduces the priority of the bug quite a lot in my estimation;
and makes me not want to incur a lot of additional code and locking to
fix it. I'm wondering whether we can simply adopt a modified strategy
for searching for a half-dead page's parent during _bt_pagedel.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 16:59:14
Message-ID: 21980.1161881954@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> What does the original research paper by Lanin & Shasha say about this?

Nothing very useful. The connection of our code to that paper is
actually a bit tenuous: their approach to deletion is to make the target
page's key space move left not right. That doesn't work real nicely in
my opinion (for one thing, you have to replace the left sibling's high
key, which is problematic for variable-size keys).

regards, tom lane


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 17:08:10
Message-ID: 4540EB7A.2080201@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> On further reflection, I think I understand why we've not realized the
> existence of this bug before: in fact, it *doesn't* lead to wrong search
> answers. I think the only visible consequence is exactly the "failed to
> re-find parent key" VACUUM error that Ed saw. The reason is that the
> key misordering in the grandparent level is nearly harmless. Using your
> example of

Yep. It's pretty harmless.

But now that I look at the original post by Ed, I don't see how the
"failed to re-find parent key" error could result from the issue we've
been talking about. The error message is printed when _bt_getstackbuf is
unable to re-find an item in the parent of a deleted page, but
_bt_getstackbuf doesn't look at or compare the keys at all.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 17:21:53
Message-ID: 22328.1161883313@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> But now that I look at the original post by Ed, I don't see how the
> "failed to re-find parent key" error could result from the issue we've
> been talking about. The error message is printed when _bt_getstackbuf is
> unable to re-find an item in the parent of a deleted page, but
> _bt_getstackbuf doesn't look at or compare the keys at all.

Right, but _bt_getstackbuf is working from a search stack created by
a standard search for the victim page's high key. If that search
descended through a page to the right of the victim page's actual
parent, _bt_getstackbuf isn't able to recover.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty btree deletion bug
Date: 2006-10-26 19:07:03
Message-ID: 23890.1161889623@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Right, but _bt_getstackbuf is working from a search stack created by
> a standard search for the victim page's high key. If that search
> descended through a page to the right of the victim page's actual
> parent, _bt_getstackbuf isn't able to recover.

What I'm tempted to do, at least in the back branches, is simply adjust
_bt_pagedel to be able to recover from _bt_getstackbuf failure in this
scenario. It could use the same method that _bt_insert_parent does in
the concurrent-root-split case, ie (untested):

ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
if (pbuf == InvalidBuffer)
+ {
+ /* Find the leftmost page at the next level up */
+ pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false);
+ stack->bts_blkno = BufferGetBlockNumber(pbuf);
+ stack->bts_offset = InvalidOffsetNumber;
+ _bt_relbuf(rel, pbuf);
+ /* and repeat search from there */
+ pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
+ if (pbuf == InvalidBuffer)
elog(ERROR, "failed to re-find parent key in \"%s\"",
RelationGetRelationName(rel));
+ }
parent = stack->bts_blkno;
poffset = stack->bts_offset;

The question is whether we want a cleaner answer for future development,
and if so what that answer ought to look like. It seems like the
alternatives we've been discussing may not end up any simpler/shorter
than the current code, and it seems hard to justify giving up some
concurrency in the name of a simplification that doesn't simplify much.
Thoughts?

regards, tom lane