Quick Links

Re: Failure while inserting parent tuple to B-tree is not fun

Lists:	pgsql-hackers

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 16:55:09
Message-ID:	5266ADED.9030008@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Splitting a B-tree page is a two-stage process: First, the page is
split, and then a downlink for the new right page is inserted into the
parent (which might recurse to split the parent page, too). What happens
if inserting the downlink fails for some reason? I tried that out, and
it turns out that it's not nice.

I used this to cause a failure:

> --- a/src/backend/access/nbtree/nbtinsert.c
> +++ b/src/backend/access/nbtree/nbtinsert.c
> @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
> _bt_relbuf(rel, pbuf);
> }
>
> + elog(ERROR, "fail!");
> +
> /* get high key from left page == lowest key on new right page */
> ritem = (IndexTuple) PageGetItem(page,
> PageGetItemId(page, P_HIKEY));

postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 10000);
ERROR: fail!

That's not surprising. But when I removed that elog again and restarted
the server, I still can't insert. The index is permanently broken:

postgres=# insert into foo select generate_series(1, 10000);
ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5

In real life, you would get a failure like this e.g if you run out of
memory or disk space while inserting the downlink to the parent.
Although rare in practice, it's no fun if it happens.

I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate "the downlink hasn't been inserted
yet". When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.

- Heikki

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 17:27:38
Message-ID:	CAM3SWZSH8G48C1iWbc8gE4XXVnwRYy1m1LtUJmJLk4ZS62aUHQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I propose that we use the same approach I used with GiST, and add a flag to
> the page header to indicate "the downlink hasn't been inserted yet". When
> insertion (or vacuum) bumps into a flagged page, it can finish the
> incomplete action by inserting the downlink.

Sounds very reasonable, but I'm interested in how you intend to
structure things, given this sounds like what could loosely be called
btree private state. I may also need to use a flag bit for something
that is of interest to the btree code alone.

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 17:30:35
Message-ID:	5266B63B.2010208@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.10.2013 20:27, Peter Geoghegan wrote:
> On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> I propose that we use the same approach I used with GiST, and add a flag to
>> the page header to indicate "the downlink hasn't been inserted yet". When
>> insertion (or vacuum) bumps into a flagged page, it can finish the
>> incomplete action by inserting the downlink.
>
> Sounds very reasonable, but I'm interested in how you intend to
> structure things, given this sounds like what could loosely be called
> btree private state. I may also need to use a flag bit for something
> that is of interest to the btree code alone.

I may be missing something, but there are already plenty of b-tree
specific flags. See BTP_* in nbtree.h. I'll just add another to that list.

- Heikki

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 17:45:27
Message-ID:	CAM3SWZTo53wN5up1kv7Sa=DQC2yCBz1=8+x4LGkArte_EGgrdA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 22, 2013 at 10:30 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I may be missing something, but there are already plenty of b-tree specific
> flags. See BTP_* in nbtree.h. I'll just add another to that list.

Based on your remarks, I thought that you were intent on directly
using page level bits (pd_flags), rather than the private state flag
bits. Blame it on the lack of coffee, I suppose.

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 18:25:46
Message-ID:	20131022182546.GF7435@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
> Splitting a B-tree page is a two-stage process: First, the page is split,
> and then a downlink for the new right page is inserted into the parent
> (which might recurse to split the parent page, too). What happens if
> inserting the downlink fails for some reason? I tried that out, and it turns
> out that it's not nice.
>
> I used this to cause a failure:
>
> >--- a/src/backend/access/nbtree/nbtinsert.c
> >+++ b/src/backend/access/nbtree/nbtinsert.c
> >@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
> > _bt_relbuf(rel, pbuf);
> > }
> >
> >+ elog(ERROR, "fail!");
> >+
> > /* get high key from left page == lowest key on new right page */
> > ritem = (IndexTuple) PageGetItem(page,
> > PageGetItemId(page, P_HIKEY));
>
> postgres=# create table foo (i int4 primary key);
> CREATE TABLE
> postgres=# insert into foo select generate_series(1, 10000);
> ERROR: fail!
>
> That's not surprising. But when I removed that elog again and restarted the
> server, I still can't insert. The index is permanently broken:
>
> postgres=# insert into foo select generate_series(1, 10000);
> ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5
>
> In real life, you would get a failure like this e.g if you run out of memory
> or disk space while inserting the downlink to the parent. Although rare in
> practice, it's no fun if it happens.

Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 18:29:13
Message-ID:	5266C3F9.4020803@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.10.2013 21:25, Andres Freund wrote:
> On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
>> Splitting a B-tree page is a two-stage process: First, the page is split,
>> and then a downlink for the new right page is inserted into the parent
>> (which might recurse to split the parent page, too). What happens if
>> inserting the downlink fails for some reason? I tried that out, and it turns
>> out that it's not nice.
>>
>> I used this to cause a failure:
>>
>>> --- a/src/backend/access/nbtree/nbtinsert.c
>>> +++ b/src/backend/access/nbtree/nbtinsert.c
>>> @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
>>> _bt_relbuf(rel, pbuf);
>>> }
>>>
>>> + elog(ERROR, "fail!");
>>> +
>>> /* get high key from left page == lowest key on new right page */
>>> ritem = (IndexTuple) PageGetItem(page,
>>> PageGetItemId(page, P_HIKEY));
>>
>> postgres=# create table foo (i int4 primary key);
>> CREATE TABLE
>> postgres=# insert into foo select generate_series(1, 10000);
>> ERROR: fail!
>>
>> That's not surprising. But when I removed that elog again and restarted the
>> server, I still can't insert. The index is permanently broken:
>>
>> postgres=# insert into foo select generate_series(1, 10000);
>> ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5
>>
>> In real life, you would get a failure like this e.g if you run out of memory
>> or disk space while inserting the downlink to the parent. Although rare in
>> practice, it's no fun if it happens.
>
> Why doesn't the incomplete split mechanism prevent this? Because we do
> not delay checkpoints on the primary and a checkpoint happened just
> befor your elog(ERROR) above?

Because there's no recovery involved. The failure I injected (or an
out-of-memory or out-of-disk-space in the real world) doesn't cause a
PANIC, just an ERROR that rolls back the current transaction, nothing more.

We could put a critical section around the whole recursion that inserts
the downlinks, so that you would get a PANIC and the incomplete split
mechanism would fix it at recovery. But that would hardly be an improvement.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 18:34:42
Message-ID:	20131022183442.GG7435@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
> On 22.10.2013 21:25, Andres Freund wrote:
> >On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
> >>Splitting a B-tree page is a two-stage process: First, the page is split,
> >>and then a downlink for the new right page is inserted into the parent
> >>(which might recurse to split the parent page, too). What happens if
> >>inserting the downlink fails for some reason? I tried that out, and it turns
> >>out that it's not nice.
> >>
> >>I used this to cause a failure:
> >>
> >>>--- a/src/backend/access/nbtree/nbtinsert.c
> >>>+++ b/src/backend/access/nbtree/nbtinsert.c
> >>>@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
> >>> _bt_relbuf(rel, pbuf);
> >>> }
> >>>
> >>>+ elog(ERROR, "fail!");
> >>>+
> >>> /* get high key from left page == lowest key on new right page */
> >>> ritem = (IndexTuple) PageGetItem(page,
> >>> PageGetItemId(page, P_HIKEY));
> >>
> >>postgres=# create table foo (i int4 primary key);
> >>CREATE TABLE
> >>postgres=# insert into foo select generate_series(1, 10000);
> >>ERROR: fail!
> >>
> >>That's not surprising. But when I removed that elog again and restarted the
> >>server, I still can't insert. The index is permanently broken:
> >>
> >>postgres=# insert into foo select generate_series(1, 10000);
> >>ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5
> >>
> >>In real life, you would get a failure like this e.g if you run out of memory
> >>or disk space while inserting the downlink to the parent. Although rare in
> >>practice, it's no fun if it happens.
> >
> >Why doesn't the incomplete split mechanism prevent this? Because we do
> >not delay checkpoints on the primary and a checkpoint happened just
> >befor your elog(ERROR) above?
>
> Because there's no recovery involved. The failure I injected (or an
> out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC,
> just an ERROR that rolls back the current transaction, nothing more.
>
> We could put a critical section around the whole recursion that inserts the
> downlinks, so that you would get a PANIC and the incomplete split mechanism
> would fix it at recovery. But that would hardly be an improvement.

You were talking about restarting the server, that's why I assumed
recovery had been involved... But you just were talking about removing
the elog() again.

For me this clearly *has* to be in a critical section with the current
code. I had always assumed all multi-part actions would be.

Do you forsee the fix with ignoring missing downlinks to be
back-patchable? FWIW, I think I might have seen real-world cases of
this.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 19:24:40
Message-ID:	3673.1382469880@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
>> We could put a critical section around the whole recursion that inserts the
>> downlinks, so that you would get a PANIC and the incomplete split mechanism
>> would fix it at recovery. But that would hardly be an improvement.

> For me this clearly *has* to be in a critical section with the current
> code. I had always assumed all multi-part actions would be.

No, that's hardly a good idea. As Heikki says, that would amount to
converting an entirely foreseeable situation into a PANIC.

Note also that the problem might be persistent, eg if you're out of disk
space. In that case, you'd just get the PANIC again at recovery, so now
not only have you crashed all your sessions but you have a database that
won't start up.

I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.

regards, tom lane

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 19:37:33
Message-ID:	5266D3FD.5080501@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.10.2013 22:24, Tom Lane wrote:
> I wonder whether Heikki's approach could be used to remove the need for
> the incomplete-split-fixup code altogether, thus eliminating a class of
> recovery failure possibilities.

Yes. I intend to do that, too.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 20:26:03
Message-ID:	20131022202603.GA9370@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
> >> We could put a critical section around the whole recursion that inserts the
> >> downlinks, so that you would get a PANIC and the incomplete split mechanism
> >> would fix it at recovery. But that would hardly be an improvement.
>
> > For me this clearly *has* to be in a critical section with the current
> > code. I had always assumed all multi-part actions would be.
>
> No, that's hardly a good idea. As Heikki says, that would amount to
> converting an entirely foreseeable situation into a PANIC.

But IIUC this can currently lead to an index giving wrong answers, not
"just" fail at further inserts? I think if we half-split (without
inserting the downlink) a page several times, via different child-pages,
we might "suceed" in splitting but we'll have corrupted left/right
links. With complete splits such things should be prevented by the
page-level locks we hold, but that's not the case anymore.
If so that could, especially in combination with unique indexes, lead to
quite nasty data corruption

> I wonder whether Heikki's approach could be used to remove the need for
> the incomplete-split-fixup code altogether, thus eliminating a class of
> recovery failure possibilities.

That'd be better...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 20:38:05
Message-ID:	5377.1382474285@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
>> No, that's hardly a good idea. As Heikki says, that would amount to
>> converting an entirely foreseeable situation into a PANIC.

> But IIUC this can currently lead to an index giving wrong answers, not
> "just" fail at further inserts?

No. It's exactly the same situation as when the insert is still in
progress, ie searches have to move right when following the original
downlink. Go read the nbtree README.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-22 20:53:17
Message-ID:	20131022205317.GB9370@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-10-22 16:38:05 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
> >> No, that's hardly a good idea. As Heikki says, that would amount to
> >> converting an entirely foreseeable situation into a PANIC.
>
> > But IIUC this can currently lead to an index giving wrong answers, not
> > "just" fail at further inserts?
>
> No. It's exactly the same situation as when the insert is still in
> progress, ie searches have to move right when following the original
> downlink. Go read the nbtree README.

If the parent insert is still in progress we have a write lock on the
left and right page preventing such issues, right? At least that's what
the nbtree README says...
That doesn't hold true if we split a page but fail before re-inserting
the parent since the transaction rollback will obviously have released
the locks.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-10-25 19:13:17
Message-ID:	526AC2CD.9040307@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.10.2013 19:55, Heikki Linnakangas wrote:
> I fixed the the same problem in GiST a few years back, by making it
> tolerate missing downlinks, and inserting them lazily. The B-tree code
> tolerates them already on scans, but gets confused on insertion, as seen
> above. I propose that we use the same approach I used with GiST, and add
> a flag to the page header to indicate "the downlink hasn't been inserted
> yet". When insertion (or vacuum) bumps into a flagged page, it can
> finish the incomplete action by inserting the downlink.

This is what I came up with.

One thing I'm not totally happy about is the way page deletions of
incompletely split pages are handled. Basically, it just bails out and
refuses to delete a page that is part of an incomplete split. That's
probably OK in practice, as incomplete splits should be very rare
anyway, but it's a bit dissatisfying to not handle the case because at
first glance it seems like it should be even simpler than usual to
delete a page that has no downlink. Nevertheless, I decided to just skip
that for now.

After this patch, deleting the only child of a parent and the parent
itself is still a multi-WAL-record operation that needs to be tracked
during recovery, and completed at the end of recovery. I'd like to
eliminate that too, but that's another patch.

- Heikki

Attachment	Content-Type	Size
btree-incomplete-split-1.patch	text/x-diff	45.3 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-11-05 13:07:06
Message-ID:	5278ED7A.5050601@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 25.10.2013 22:13, Heikki Linnakangas wrote:
> On 22.10.2013 19:55, Heikki Linnakangas wrote:
>> I fixed the the same problem in GiST a few years back, by making it
>> tolerate missing downlinks, and inserting them lazily. The B-tree code
>> tolerates them already on scans, but gets confused on insertion, as seen
>> above. I propose that we use the same approach I used with GiST, and add
>> a flag to the page header to indicate "the downlink hasn't been inserted
>> yet". When insertion (or vacuum) bumps into a flagged page, it can
>> finish the incomplete action by inserting the downlink.
>
> This is what I came up with.
>
> One thing I'm not totally happy about is the way page deletions of
> incompletely split pages are handled. Basically, it just bails out and
> refuses to delete a page that is part of an incomplete split. That's
> probably OK in practice, as incomplete splits should be very rare
> anyway, but it's a bit dissatisfying to not handle the case because at
> first glance it seems like it should be even simpler than usual to
> delete a page that has no downlink. Nevertheless, I decided to just skip
> that for now.
>
> After this patch, deleting the only child of a parent and the parent
> itself is still a multi-WAL-record operation that needs to be tracked
> during recovery, and completed at the end of recovery. I'd like to
> eliminate that too, but that's another patch.

Here's a new version of this, which uses a similar technique to handle
page deletions, eliminating the "incomplete action" tracking code
altogether (from btree). When an internal page is marked as half-dead,
its right sibling is atomically marked with a
"left-sibling-is-half-dead" flag. Whenever an insertion encounters a
page with that flag set, it will finish the deletion of the left sibling
before proceeding with the insertion.

This needs a lot more testing, but I wanted to get this out for review,
in case someone sees a fundamental problem with this.

- Heikki

Attachment	Content-Type	Size
btree-incomplete-actions-2.patch	text/x-diff	82.5 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2013-11-14 17:23:39
Message-ID:	5285071B.1040100@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05.11.2013 15:07, Heikki Linnakangas wrote:
> On 25.10.2013 22:13, Heikki Linnakangas wrote:
>> On 22.10.2013 19:55, Heikki Linnakangas wrote:
>>> I fixed the the same problem in GiST a few years back, by making it
>>> tolerate missing downlinks, and inserting them lazily. The B-tree code
>>> tolerates them already on scans, but gets confused on insertion, as seen
>>> above. I propose that we use the same approach I used with GiST, and add
>>> a flag to the page header to indicate "the downlink hasn't been inserted
>>> yet". When insertion (or vacuum) bumps into a flagged page, it can
>>> finish the incomplete action by inserting the downlink.
>>
>> This is what I came up with.
>>
>> One thing I'm not totally happy about is the way page deletions of
>> incompletely split pages are handled. Basically, it just bails out and
>> refuses to delete a page that is part of an incomplete split. That's
>> probably OK in practice, as incomplete splits should be very rare
>> anyway, but it's a bit dissatisfying to not handle the case because at
>> first glance it seems like it should be even simpler than usual to
>> delete a page that has no downlink. Nevertheless, I decided to just skip
>> that for now.
>>
>> After this patch, deleting the only child of a parent and the parent
>> itself is still a multi-WAL-record operation that needs to be tracked
>> during recovery, and completed at the end of recovery. I'd like to
>> eliminate that too, but that's another patch.
>
> Here's a new version of this, which uses a similar technique to handle
> page deletions, eliminating the "incomplete action" tracking code
> altogether (from btree). When an internal page is marked as half-dead,
> its right sibling is atomically marked with a
> "left-sibling-is-half-dead" flag. Whenever an insertion encounters a
> page with that flag set, it will finish the deletion of the left sibling
> before proceeding with the insertion.
>
> This needs a lot more testing, but I wanted to get this out for review,
> in case someone sees a fundamental problem with this.

Ok, here's a new version of the patch to handle incomplete B-tree
splits. I rejected the scheme I outlined above about handling half-dead
pages - instead, see the patch in the "Race condition in b-tree page
deletion" thread:
http://www.postgresql.org/message-id/5283A2FC.6040606@vmware.com. That
approach to handling half-dead pages makes the incomplete-split stuff a
lot easier, as half-dead pages don't need any special treatment wrt.
splits- This patch should be after that patch.

- Heikki

Attachment	Content-Type	Size
btree-incomplete-splits-2.patch	text/x-diff	56.9 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-01-23 21:36:24
Message-ID:	CAM3SWZR-Btfvp6zAKn_BT88TC_9BxMTF0wWshjtq4r3C==W+hA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 14, 2013 at 9:23 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Ok, here's a new version of the patch to handle incomplete B-tree splits.

I finally got around to taking a look at this. Unlike with the as-yet
uncommitted "Race condition in b-tree page deletion" patch that Kevin
looked at, which there is a dependency on here, I could not really
consider your patch in isolation. I'm really reviewing both patches,
but with a particular focus on this recent set of additions
(btree-incomplete-splits-2.patch), and the issues addressed by it.
However, since the distinction between this patch and the patch that
it has a dependency on is fuzzy, expect me to be fuzzy in
differentiating the two. This patch is only very loosely an atomic
unit. This is not a criticism - I understand that it just turned out
that way.

The first thing I noticed about this patchset is that it completely
expunges btree_xlog_startup(), btree_xlog_cleanup() and
btree_safe_restartpoint(). The post-recovery cleanup that previously
occurred to address both sets of problems (the problem addressed by
this patch, incomplete page splits, and the problem addressed by the
dependency patch, incomplete page deletions) now both occur
opportunistically/lazily. Notably, this means that there are now
exactly zero entries in the resource manager list with a
'restartpoint' callback. I guess we should consider removing it
entirely for that reason. As you said in the commit message where
gin_safe_restartpoint was similarly retired (commit 631118fe):

"""
The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.
"""

I'm of the general impression that post-recovery cleanup is
questionable. I'm surprised that you didn't mention this commit
previously. You just mentioned the original analogous work on GiST,
but this certainly seems to be something you've been thinking about
*systematically* eliminating for a while.

So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.

I think you should bump XLOG_PAGE_MAGIC.

Perhaps you should increase the elevel here:

+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));

Since this is a very rare occurrence that involves some subtle
interactions, I can't think why you wouldn't want to LOG this for
forensic purposes.

Why did you remove the local linkage of _bt_walk_left(), given that it
is called in exactly the same way as before? I guess that that is just
a vestige of some earlier revision.

I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that "opaque" (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the "opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto "page". So "opaque" may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the "opaque" pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets "stuck on"). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.

For a moment I thought that you might have accounted for that here:

>*************** _bt_insert_parent(Relation rel,
>*** 1685,1696 ****
> * 05/27/97
> */
> ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
>-
> pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
>
>! /* Now we can unlock the children */
> _bt_relbuf(rel, rbuf);
>- _bt_relbuf(rel, buf);
>
> /* Check for error only after writing children */
> if (pbuf == InvalidBuffer)
>--- 1767,1779 ----
> * 05/27/97
> */
> ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
> pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
>
>! /*
>! * Now we can unlock the right child. The left child will be unlocked
>! * by _bt_insertonpg().
>! */
> _bt_relbuf(rel, rbuf);
>
> /* Check for error only after writing children */
> if (pbuf == InvalidBuffer)

But this is unrelated - the buffer + pin are still dropped by a later
_bt_insertonpg(), as it says right there. Actually, to see that there
is a bug in _bt_moveright() you don't really have to look much further
than _bt_moveright(). So I apologize for taking you on that
unnecessary detour, but FWIW that was my thought process, and I like
to document that for my own benefit. :-)

Do you suppose that there are similar problems in other direct callers
of _bt_finish_split()? I'm thinking in particular of
_bt_findinsertloc().

I'm also not sure about the lock escalation that may occur within
_bt_moveright() for callers with BT_READ access - there is nothing to
prevent a caller from ending up being left with a write lock where
before they only had a read lock if they find that
!P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
conclude that there is no split finishing to be done after all. It
looks like all callers currently won't care about that, but it seems
like that should either be prominently documented, or just not occur.
These interactions are complex enough that we ought to be very
explicit about where pins are required and dropped, or locks held
before or after calling.

I suggest you consider refactoring the loop in _bt_moveright() to
reflect these subtleties. I think the structure of that routine could
use some polishing. I'm not sure that I like the way that that and
similar loops get "stuck on" page splits, although no better design
occurs to me right now.

That's all I have for now. I've written plenty of notes, and will work
back through other points of possible concern. I don't suppose you
have any testing infrastructure that you could publish?

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-01-27 18:27:38
Message-ID:	52E6A51A.7030006@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/23/2014 11:36 PM, Peter Geoghegan wrote:
> The first thing I noticed about this patchset is that it completely
> expunges btree_xlog_startup(), btree_xlog_cleanup() and
> btree_safe_restartpoint(). The post-recovery cleanup that previously
> occurred to address both sets of problems (the problem addressed by
> this patch, incomplete page splits, and the problem addressed by the
> dependency patch, incomplete page deletions) now both occur
> opportunistically/lazily. Notably, this means that there are now
> exactly zero entries in the resource manager list with a
> 'restartpoint' callback. I guess we should consider removing it
> entirely for that reason. As you said in the commit message where
> gin_safe_restartpoint was similarly retired (commit 631118fe):
>
> """
> The post-recovery cleanup mechanism was never totally reliable, as insertion
> to the parent could fail e.g because of running out of memory or disk space,
> leaving the tree in an inconsistent state.
> """
>
> I'm of the general impression that post-recovery cleanup is
> questionable. I'm surprised that you didn't mention this commit
> previously. You just mentioned the original analogous work on GiST,
> but this certainly seems to be something you've been thinking about
> *systematically* eliminating for a while.

Yes, that's correct, these b-tree patches are part of a grand plan to
eliminate post-recovery cleanup actions altogether.

> So while post-recovery callbacks no longer exist for any
> rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
> concern the simple management of memory of AM-specific recovery
> contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
> a better abstraction than that, such as a generic recovery memory
> context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
> just calls those callbacks for each resource at exactly the same time
> anyway, just as it initializes resource managers in precisely the same
> manner earlier on. Plus if you look at what those AM-local memory
> management routines do, it all seems very simple.
>
> I think you should bump XLOG_PAGE_MAGIC.
>
> Perhaps you should increase the elevel here:
>
> + elog(DEBUG1, "finishing incomplete split of %u/%u",
> + BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
>
> Since this is a very rare occurrence that involves some subtle
> interactions, I can't think why you wouldn't want to LOG this for
> forensic purposes.

Hmm. I'm not sure I agree with that line of thinking in general - what
is an admin supposed to do about the message? But there's a precedent in
vacuumlazy.c, which logs a message when it finds all-zero pages in the
heap. So I guess that should be a LOG.

> Why did you remove the local linkage of _bt_walk_left(), given that it
> is called in exactly the same way as before? I guess that that is just
> a vestige of some earlier revision.

Yeah, reverted.

> I think I see some bugs in _bt_moveright(). If you examine
> _bt_finish_split() in detail, you'll see that it doesn't just drop the
> write buffer lock that the caller will have provided (per its
> comments) - it also drops the buffer pin. It calls _bt_insert_parent()
> last, which was previously only called last thing by _bt_insertonpg()
> (some of the time), and _bt_insertonpg() is indeed documented to drop
> both the lock and the pin. And if you look at _bt_moveright() again,
> you'll see that there is a tacit assumption that the buffer pin isn't
> dropped, or at least that "opaque" (the BTPageOpaque BT special page
> area pointer) continues to point to the same page held in the same
> buffer, even though the code doesn't set the "opaque' pointer again
> and it may not point to a pinned buffer or even the appropriate
> buffer. Ditto "page". So "opaque" may point to the wrong thing on
> subsequent iterations - you don't do anything with the return value of
> _bt_getbuf(), unlike all of the existing call sites. I believe you
> should store its return value, and get the held page and the special
> pointer on the page, and assign that to the "opaque" pointer before
> the next iteration (an iteration in which you very probably really do
> make progress not limited to completing a page split, the occurrence
> of which the _bt_moveright() loop gets "stuck on"). You know, do what
> is done in the non-split-handling case. It may not always be the same
> buffer as before, obviously.

Yep, fixed.

> Do you suppose that there are similar problems in other direct callers
> of _bt_finish_split()? I'm thinking in particular of
> _bt_findinsertloc().

Hmm, no, the other callers look OK to me.

> I'm also not sure about the lock escalation that may occur within
> _bt_moveright() for callers with BT_READ access - there is nothing to
> prevent a caller from ending up being left with a write lock where
> before they only had a read lock if they find that
> !P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
> conclude that there is no split finishing to be done after all. It
> looks like all callers currently won't care about that, but it seems
> like that should either be prominently documented, or just not occur.
> These interactions are complex enough that we ought to be very
> explicit about where pins are required and dropped, or locks held
> before or after calling.

I haven't looked into this in detail yet, but I admit I don't much like
the _bt_moveright() function signature. It's strange to pass 'access'
and 'forupdate' as separate arguments, and it's not totally clear what
combinations make sense and what they mean. Some kind of refactoring is
probably in order.

Here's a new version, rebased over the latest page-deletion patch,
fix-btree-page-deletion-3.patch
(http://www.postgresql.org/message-id/52E66E84.2050109@vmware.com). I
haven't tested this extensively, but passes "make check"...

- Heikki

Attachment	Content-Type	Size
btree-incomplete-split-3.patch	text/x-diff	45.9 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-01-27 18:54:13
Message-ID:	CAM3SWZRdt=od2uyLorBS-6t7kRDZbNgF7V4YXWg527y6ZRbXNw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 27, 2014 at 10:27 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>> I think I see some bugs in _bt_moveright(). If you examine
>> _bt_finish_split() in detail, you'll see that it doesn't just drop the
>> write buffer lock that the caller will have provided (per its
>> comments) - it also drops the buffer pin. It calls _bt_insert_parent()
>> last, which was previously only called last thing by _bt_insertonpg()
>> (some of the time), and _bt_insertonpg() is indeed documented to drop
>> both the lock and the pin. And if you look at _bt_moveright() again,
>> you'll see that there is a tacit assumption that the buffer pin isn't
>> dropped, or at least that "opaque" (the BTPageOpaque BT special page
>> area pointer) continues to point to the same page held in the same
>> buffer, even though the code doesn't set the "opaque' pointer again
>> and it may not point to a pinned buffer or even the appropriate
>> buffer. Ditto "page". So "opaque" may point to the wrong thing on
>> subsequent iterations - you don't do anything with the return value of
>> _bt_getbuf(), unlike all of the existing call sites. I believe you
>> should store its return value, and get the held page and the special
>> pointer on the page, and assign that to the "opaque" pointer before
>> the next iteration (an iteration in which you very probably really do
>> make progress not limited to completing a page split, the occurrence
>> of which the _bt_moveright() loop gets "stuck on"). You know, do what
>> is done in the non-split-handling case. It may not always be the same
>> buffer as before, obviously.
>
>
> Yep, fixed.

Can you explain what the fix was, please?

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-01-27 18:58:57
Message-ID:	52E6AC71.9010905@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/23/2014 11:36 PM, Peter Geoghegan wrote:
> That's all I have for now. I've written plenty of notes, and will work
> back through other points of possible concern. I don't suppose you
> have any testing infrastructure that you could publish?

Okay, promise not to laugh. I did write a bunch of hacks, to generate
graphviz .dot files from the btree pages, and render them into pictures.
It consist of multiple parts, all in the attached tarball. Here's how to
use it:

After installing all the parts (instructions below), launch the
btree-snapshot.sh script. It will poll every seconds, and generate a
.dot graph out of an index called 'i_foo'. It compares the snapshot with
the previous one, and if it differs, it generates a new .png file from
it under /tmp.

With that, you can get a slideshow of how the index changes, when you
execute commands that modify it. However, because it only polls once per
second, you'll have to insert some sleeps into the code you're testing,
so that the script can catch the changes in action. See attached
nbtinsert-sleeps.patch.

For example:

create table foo (t text);
create index i_foo on foo (t);

-- launch btree-snapshot.sh in another terminal

insert into foo select 'aaa' || g from generate_series(1, 10000) g;

Once that finishes, the btree-snapshot.sh script should've generated a
bunch of .png files in /tmp/:

~$ ls /tmp/*.png
/tmp/g1.png /tmp/g3.png /tmp/g5.png /tmp/g7.png
/tmp/g2.png /tmp/g4.png /tmp/g6.png /tmp/g8.png

Each shows the structure of the tree, after something changed. It shows
how a page is split, and then the downlink to it is inserted into the
parent as a separate step. I've attached those files in
graphs-example.tar. I view them with "eog /tmp/g*.png", it lets you flip
through the pictures easily.

To install this hack, do the following:

1. Patch pageinspect contrib module to not lock the pages while it looks
at them. (otherwise you won't be able to snapshot transient states where
a backend is holding pages locked)

2. Install extensions pageinspect and pgstattuple:

create extension pageinspect;
create extension pgstattuple;

3. Run btree-graphviz2.sql. It creates a bunch of functions.

That's it. Have fun :-)

- Heikki

Attachment	Content-Type	Size
btree-graph-hack.tar	application/x-tar	10.0 KB
nbtinsert-sleeps.patch	text/x-diff	605 bytes
graphs-example.tar	application/x-tar	70.0 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-01-27 19:55:42
Message-ID:	CAM3SWZT1DjNjK14GphpPiw1QSEq+WJBqParj9be=tA+A+LN4NQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 27, 2014 at 10:58 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Okay, promise not to laugh. I did write a bunch of hacks, to generate
> graphviz .dot files from the btree pages, and render them into pictures. It
> consist of multiple parts, all in the attached tarball.

It's funny that you should say that, because I was thinking about
building something similar over the past few days. I find it very
useful to build ad-hoc tools like that, and I thought that it would be
particularly useful to have something visual for testing btree code.
Sure, you can come up with a test and keep the details straight in
your head while coordinating everything, but that won't scale as new
testing scenarios inevitably occur to you. I've done plenty of things
with contrib/pageinspect along these lines in the past, but this is
better. Anything that reduces the friction involved in building an
accurate mental model of things is a very good thing.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-01-29 23:46:08
Message-ID:	CAM3SWZRBwB=7mwwRM=OeDE1s6RGxy6m1KaHjp7YcfGwRetQOOA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 27, 2014 at 10:54 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Mon, Jan 27, 2014 at 10:27 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>>> I think I see some bugs in _bt_moveright(). If you examine
>>> _bt_finish_split() in detail, you'll see that it doesn't just drop the
>>> write buffer lock that the caller will have provided (per its
>>> comments) - it also drops the buffer pin. It calls _bt_insert_parent()
>>> last, which was previously only called last thing by _bt_insertonpg()
>>> (some of the time), and _bt_insertonpg() is indeed documented to drop
>>> both the lock and the pin. And if you look at _bt_moveright() again,
>>> you'll see that there is a tacit assumption that the buffer pin isn't
>>> dropped, or at least that "opaque" (the BTPageOpaque BT special page
>>> area pointer) continues to point to the same page held in the same
>>> buffer, even though the code doesn't set the "opaque' pointer again
>>> and it may not point to a pinned buffer or even the appropriate
>>> buffer. Ditto "page". So "opaque" may point to the wrong thing on
>>> subsequent iterations - you don't do anything with the return value of
>>> _bt_getbuf(), unlike all of the existing call sites. I believe you
>>> should store its return value, and get the held page and the special
>>> pointer on the page, and assign that to the "opaque" pointer before
>>> the next iteration (an iteration in which you very probably really do
>>> make progress not limited to completing a page split, the occurrence
>>> of which the _bt_moveright() loop gets "stuck on"). You know, do what
>>> is done in the non-split-handling case. It may not always be the same
>>> buffer as before, obviously.
>>
>>
>> Yep, fixed.
>
> Can you explain what the fix was, please?

Ping? I would like to hear some details here.

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-01-31 17:09:00
Message-ID:	52EBD8AC.8070609@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/30/2014 12:46 AM, Peter Geoghegan wrote:
> On Mon, Jan 27, 2014 at 10:54 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> On Mon, Jan 27, 2014 at 10:27 AM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>>> I think I see some bugs in _bt_moveright(). If you examine
>>>> _bt_finish_split() in detail, you'll see that it doesn't just drop the
>>>> write buffer lock that the caller will have provided (per its
>>>> comments) - it also drops the buffer pin. It calls _bt_insert_parent()
>>>> last, which was previously only called last thing by _bt_insertonpg()
>>>> (some of the time), and _bt_insertonpg() is indeed documented to drop
>>>> both the lock and the pin. And if you look at _bt_moveright() again,
>>>> you'll see that there is a tacit assumption that the buffer pin isn't
>>>> dropped, or at least that "opaque" (the BTPageOpaque BT special page
>>>> area pointer) continues to point to the same page held in the same
>>>> buffer, even though the code doesn't set the "opaque' pointer again
>>>> and it may not point to a pinned buffer or even the appropriate
>>>> buffer. Ditto "page". So "opaque" may point to the wrong thing on
>>>> subsequent iterations - you don't do anything with the return value of
>>>> _bt_getbuf(), unlike all of the existing call sites. I believe you
>>>> should store its return value, and get the held page and the special
>>>> pointer on the page, and assign that to the "opaque" pointer before
>>>> the next iteration (an iteration in which you very probably really do
>>>> make progress not limited to completing a page split, the occurrence
>>>> of which the _bt_moveright() loop gets "stuck on"). You know, do what
>>>> is done in the non-split-handling case. It may not always be the same
>>>> buffer as before, obviously.
>>>
>>> Yep, fixed.
>>
>> Can you explain what the fix was, please?
>
> Ping? I would like to hear some details here.

I refactored the loop in _bt_moveright to, well, not have that bug
anymore. The 'page' and 'opaque' pointers are now fetched at the
beginning of the loop. Did I miss something?

- Heikki

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-02-04 00:40:33
Message-ID:	CAM3SWZTORu0L2AdUGSk9KMOtW3vCqCb14vjfDTHTwYEracYcQA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 31, 2014 at 9:09 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I refactored the loop in _bt_moveright to, well, not have that bug anymore.
> The 'page' and 'opaque' pointers are now fetched at the beginning of the
> loop. Did I miss something?

I think so, yes. You still aren't assigning the value returned by
_bt_getbuf() to 'buf'. Since, as I mentioned, _bt_finish_split()
ultimately unlocks *and unpins*, it may not be the same buffer as
before, so even with the refactoring there are race conditions. A
closely related issue is that you haven't mentioned anything about
buffer pins/refcount side effects in comments above
_bt_finish_split(), even though I believe you should.

A minor stylistic concern is that I think it would be better to only
have one pair of _bt_finish_split()/_bt_getbuf() calls regardless of
the initial value of 'access'.

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-02-05 07:56:42
Message-ID:	52F1EEBA.9070006@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/04/2014 02:40 AM, Peter Geoghegan wrote:
> On Fri, Jan 31, 2014 at 9:09 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> I refactored the loop in _bt_moveright to, well, not have that bug anymore.
>> The 'page' and 'opaque' pointers are now fetched at the beginning of the
>> loop. Did I miss something?
>
> I think so, yes. You still aren't assigning the value returned by
> _bt_getbuf() to 'buf'.

D'oh, you're right.

> Since, as I mentioned, _bt_finish_split() ultimately unlocks *and
> unpins*, it may not be the same buffer as before, so even with the
> refactoring there are race conditions.

Care to elaborate? Or are you just referring to the missing "buf = " ?

> A closely related issue is that you haven't mentioned anything about
> buffer pins/refcount side effects in comments above
> _bt_finish_split(), even though I believe you should.

Ok.

> A minor stylistic concern is that I think it would be better to only
> have one pair of _bt_finish_split()/_bt_getbuf() calls regardless of
> the initial value of 'access'.

Ok.

I also changed _bt_moveright to never return a write-locked buffer, when
the caller asked for a read-lock (an issue you pointed out earlier in
this thread).

Attached is a new version of the patch, with those issues fixed.
btree-incomplete-split-4.patch is a complete patch against the latest
fix-btree-page-deletion patch, and moveright-assign-fix.patch is just
the changes to _bt_moveright, if you want to review just the changes
since the previous patch I posted.

- Heikki

Attachment	Content-Type	Size
btree-incomplete-split-4.patch	text/x-diff	45.9 KB
moveright-assign-fix.patch	text/x-diff	1.8 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-02-05 08:06:33
Message-ID:	CAM3SWZT9+wjWwoynEApbMOCiBHBGEk+9jypu=b9xTUZOUxyO+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>> Since, as I mentioned, _bt_finish_split() ultimately unlocks *and
>> unpins*, it may not be the same buffer as before, so even with the
>> refactoring there are race conditions.
>
> Care to elaborate? Or are you just referring to the missing "buf = " ?

Yes, that's all I meant.

> Attached is a new version of the patch, with those issues fixed.
> btree-incomplete-split-4.patch is a complete patch against the latest
> fix-btree-page-deletion patch, and moveright-assign-fix.patch is just the
> changes to _bt_moveright, if you want to review just the changes since the
> previous patch I posted.

Cool. I'll take a good look at it tomorrow morning PST.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-02-05 23:54:33
Message-ID:	CAM3SWZS-L_r2PFo1e5OwL4SYrVuha6+DrxYwipW=nNsgTrrYYA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 23, 2014 at 1:36 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> So while post-recovery callbacks no longer exist for any
> rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
> concern the simple management of memory of AM-specific recovery
> contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
> a better abstraction than that, such as a generic recovery memory
> context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
> just calls those callbacks for each resource at exactly the same time
> anyway, just as it initializes resource managers in precisely the same
> manner earlier on. Plus if you look at what those AM-local memory
> management routines do, it all seems very simple.

What are your thoughts on this, as someone that has a broader
perspective here? Are you inclined to keep the startup and cleanup
callbacks in anticipation of a day when that degree of generality is
useful? That would be pretty well-precedented of course, but I would
like to hear your opinion.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-02-06 00:52:38
Message-ID:	CAM3SWZSecnGCP2WpYuEKt_0knTicGSZ8eGP_OX9YknE3Kt4JGw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I also changed _bt_moveright to never return a write-locked buffer, when the
> caller asked for a read-lock (an issue you pointed out earlier in this
> thread).

I think that _bt_moveright() looks good now.

There is now bitrot, caused by commit ac8bc3. I rebased myself, but
that was non-trivial, surprisingly; I had to tweak by hand.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-02-06 04:42:40
Message-ID:	CAM3SWZRGBXTYNVuAx3byQ0bo8J6HsAOx8t47bt61By1GTsxk3Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Some more thoughts:

Please add comments above _bt_mark_page_halfdead(), a new routine from
the dependency patch. I realize that this is substantially similar to
part of how _bt_pagedel() used to work, but it's still incongruous.

> ! Our approach is to create any missing downlinks on-they-fly, when
> ! searching the tree for a new insertion. It could be done during searches,
> ! too, but it seems best not to put any extra updates in what would otherwise

s/on-they-fly/on-the-fly/

I'm not sure about this:

> *************** _bt_findinsertloc(Relation rel,
> *** 675,680 ****
> --- 701,707 ----
> static void
> _bt_insertonpg(Relation rel,
> Buffer buf,
> + Buffer cbuf,
> BTStack stack,
> IndexTuple itup,
> OffsetNumber newitemoff,

Wouldn't lcbuf be a better name for the new argument? After all, in
relevant contexts 'buf' is the parent of both of the pages post-split,
but there are two children in play here. You say:

>+ * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
>+ * page we're inserting the downlink for. This function will clear the
>+ * INCOMPLETE_SPLIT flag on it, and release the buffer.

I suggest also noting in comments at this point that cbuf/the
left-sibling is the "old half" from the split.

Even having vented about cbuf's name, I can kind of see why you didn't
call it lcbuf: we already have an "BTPageOpaque lpageop" variable for
the special 'buf' metadata within the _bt_insertonpg() function; so
there might be an ambiguity between the possibly-soon-to-be-left page
(the target page) and the *child* left page that needs to have its
flag cleared. Although I still think you should change the name of
"lpageop" (further details follow).

Consider this:

> /* release buffers */
>+ if (!P_ISLEAF(lpageop))
>+ _bt_relbuf(rel, cbuf);

(Rebased, so this looks a bit different to your original version IIRC).

This is checking that the _bt_insertonpg() target page (whose special
data lpageop points to) is not a leaf page, and releasing the *child*
left page if it isn't. This is pretty confusing. So I suggest naming
"lpageop" "tpageop" ('t' for target, a terminology already used in the
comments above the function).

Also, I suggest you change this existing code comment:

* On entry, we must have the right buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
* we will have dropped both the pin and the lock on the buffer.

Change "right" to "correct" here. Minor point, but "right" is a loaded word.

But why are you doing new things while checking P_ISLEAF(lpageop) in
_bt_insertonpg() to begin with? Would you not be better off doing
things when there is a child buffer passed (e.g. "if
(BufferIsValid(cbuf))..."), and only then asserting
!P_ISLEAF(lpageop)? That seems to me to more naturally establish the
context of "_bt_insertonpg() is called to insert downlink after
split". Although maybe that conflates things within "XLOG stuff". Hmm.

Perhaps some of these observations could equally well apply to
_bt_split(), which is similarly charged with releasing a left child
page on inserting a downlink. Particularly around reconsidering the
"left vs left child vs target" terminology.

Consider what happens when _bt_split() is passed a (left) child buffer
(a 'cbuf' argument). We are:

1. Splitting a page (first phase, which may only be finished lazily
some time later).

2. Iff this is a non-leaf page split, simultaneously unmark the flag
from some *other* split which we have a child buffer to unmark. Needs
to be part of same critical section. Unlock + release cbuf, again only
iff target/right split page contains a leaf page.

So, at the risk of belaboring the point:

* Some callers to _bt_split() and _bt_insertonpg() pass a 'cbuf' that
is not invalid.

* If either of those 2 functions don't unlock + release those buffers,
no one ever will.

* Therefore, at the very least we should unlock + release those
buffers **just because they're not invalid**. That's a sufficient
condition to do so, and attaching that to "level" is unnecessarily
unclear. As I said, assert non-leafness.

Incidentally, I think that in general we could be clearer on the fact
that a root page may also be a leaf page.
--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-03-14 11:03:56
Message-ID:	CAM3SWZQnbyYT2OGpZFtoRDuAFjmjdBbzBwUsc_fEytkTShUT8A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ping?

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-03-14 16:44:05
Message-ID:	532331D5.3020507@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/14/2014 01:03 PM, Peter Geoghegan wrote:
> Ping?

I committed the other patch this depends on now. I'll take another stab
at this one next.

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-03-18 18:54:52
Message-ID:	5328967C.5040706@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/06/2014 06:42 AM, Peter Geoghegan wrote:
> I'm not sure about this:
>
>> *************** _bt_findinsertloc(Relation rel,
>> *** 675,680 ****
>> --- 701,707 ----
>> static void
>> _bt_insertonpg(Relation rel,
>> Buffer buf,
>> + Buffer cbuf,
>> BTStack stack,
>> IndexTuple itup,
>> OffsetNumber newitemoff,
>
> Wouldn't lcbuf be a better name for the new argument? After all, in
> relevant contexts 'buf' is the parent of both of the pages post-split,
> but there are two children in play here. You say:
>
>> + * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
>> + * page we're inserting the downlink for. This function will clear the
>> + * INCOMPLETE_SPLIT flag on it, and release the buffer.
>
> I suggest also noting in comments at this point that cbuf/the
> left-sibling is the "old half" from the split.
>
> Even having vented about cbuf's name, I can kind of see why you didn't
> call it lcbuf: we already have an "BTPageOpaque lpageop" variable for
> the special 'buf' metadata within the _bt_insertonpg() function; so
> there might be an ambiguity between the possibly-soon-to-be-left page
> (the target page) and the *child* left page that needs to have its
> flag cleared. Although I still think you should change the name of
> "lpageop" (further details follow).
>
> Consider this:
>
>> /* release buffers */
>> + if (!P_ISLEAF(lpageop))
>> + _bt_relbuf(rel, cbuf);
>
> (Rebased, so this looks a bit different to your original version IIRC).
>
> This is checking that the _bt_insertonpg() target page (whose special
> data lpageop points to) is not a leaf page, and releasing the *child*
> left page if it isn't. This is pretty confusing. So I suggest naming
> "lpageop" "tpageop" ('t' for target, a terminology already used in the
> comments above the function).

I don't know, the buf/page/lpageop variable names are used in many
functions in nbtinsert.c. Perhaps they should all be renamed, but I
think it's fine as it is, and not this patch's responsibility anyway.
(The lpageop name makes sense when the page has to be split, and the
page becomes the left page. Otherwise, admittedly, not so much)

> Also, I suggest you change this existing code comment:
>
> * On entry, we must have the right buffer in which to do the
> * insertion, and the buffer must be pinned and write-locked. On return,
> * we will have dropped both the pin and the lock on the buffer.
>
> Change "right" to "correct" here. Minor point, but "right" is a loaded word.

Fixed.

> But why are you doing new things while checking P_ISLEAF(lpageop) in
> _bt_insertonpg() to begin with? Would you not be better off doing
> things when there is a child buffer passed (e.g. "if
> (BufferIsValid(cbuf))..."), and only then asserting
> !P_ISLEAF(lpageop)? That seems to me to more naturally establish the
> context of "_bt_insertonpg() is called to insert downlink after
> split". Although maybe that conflates things within "XLOG stuff". Hmm.

Changed that way for the places where we deal with the incomplete-split
flag.

I committed the patch now. Thanks for the review!

Before committing, I spotted a bug in the way the new page-deletion code
interacts with this: there was a check that the page that's about to be
deleted was not marked with the INCOMPLETE_SPLIT flag, it was possible
that the page was the right half of an incomplete split, and so didn't
have a downlink. Vacuuming that failed with an "failed to re-find parent
key" error. I fixed that by checking that the left sibling is also not
marked with the flag.

It would be fairly easy to delete a page that is the right half of an
incomplete split. Such a page doesn't have a downlink pointing to it, so
just skip removing it, and instead clear the INCOMPLETE_SPLIT flag in
the left sibling. But deleting incompletely split pages isn't important
from a disk-space point of view, they should be extremely rare, so I
decided to just punt.

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-03-18 20:34:50
Message-ID:	5328ADEA.2070800@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/06/2014 01:54 AM, Peter Geoghegan wrote:
> On Thu, Jan 23, 2014 at 1:36 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> So while post-recovery callbacks no longer exist for any
>> rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
>> concern the simple management of memory of AM-specific recovery
>> contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
>> a better abstraction than that, such as a generic recovery memory
>> context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
>> just calls those callbacks for each resource at exactly the same time
>> anyway, just as it initializes resource managers in precisely the same
>> manner earlier on. Plus if you look at what those AM-local memory
>> management routines do, it all seems very simple.
>
> What are your thoughts on this, as someone that has a broader
> perspective here? Are you inclined to keep the startup and cleanup
> callbacks in anticipation of a day when that degree of generality is
> useful? That would be pretty well-precedented of course, but I would
> like to hear your opinion.

So, I just removed the rm_safe_restartpoint callback, as that's clearly
dead now and I would complain loudly if someone tried to add a resource
manager that would need it again.

Yeah, it's a bit silly that each resource manager has to do that on
their own. It would be useful to have a memory context that was
automatically reset between each WAL record. In fact that should
probably be the default memory context you run the WAL redo routines in.

But even if we do that, I'm not in a hurry to remove rm_startup/cleanup.
They seem generally useful, even if they're not actually used for much
anymore.

- Heikki

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Failure while inserting parent tuple to B-tree is not fun
Date:	2014-03-18 20:47:12
Message-ID:	27241.1395175632@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> Yeah, it's a bit silly that each resource manager has to do that on
> their own. It would be useful to have a memory context that was
> automatically reset between each WAL record. In fact that should
> probably be the default memory context you run the WAL redo routines in.

> But even if we do that, I'm not in a hurry to remove rm_startup/cleanup.
> They seem generally useful, even if they're not actually used for much
> anymore.

Agreed; they aren't hurting us and they could be useful.

regards, tom lane