Re: Failure while inserting parent tuple to B-tree is not fun

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Failure while inserting parent tuple to B-tree is not fun
Date: 2014-01-23 21:36:24
Message-ID: CAM3SWZR-Btfvp6zAKn_BT88TC_9BxMTF0wWshjtq4r3C==W+hA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 14, 2013 at 9:23 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Ok, here's a new version of the patch to handle incomplete B-tree splits.

I finally got around to taking a look at this. Unlike with the as-yet
uncommitted "Race condition in b-tree page deletion" patch that Kevin
looked at, which there is a dependency on here, I could not really
consider your patch in isolation. I'm really reviewing both patches,
but with a particular focus on this recent set of additions
(btree-incomplete-splits-2.patch), and the issues addressed by it.
However, since the distinction between this patch and the patch that
it has a dependency on is fuzzy, expect me to be fuzzy in
differentiating the two. This patch is only very loosely an atomic
unit. This is not a criticism - I understand that it just turned out
that way.

The first thing I noticed about this patchset is that it completely
expunges btree_xlog_startup(), btree_xlog_cleanup() and
btree_safe_restartpoint(). The post-recovery cleanup that previously
occurred to address both sets of problems (the problem addressed by
this patch, incomplete page splits, and the problem addressed by the
dependency patch, incomplete page deletions) now both occur
opportunistically/lazily. Notably, this means that there are now
exactly zero entries in the resource manager list with a
'restartpoint' callback. I guess we should consider removing it
entirely for that reason. As you said in the commit message where
gin_safe_restartpoint was similarly retired (commit 631118fe):

"""
The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.
"""

I'm of the general impression that post-recovery cleanup is
questionable. I'm surprised that you didn't mention this commit
previously. You just mentioned the original analogous work on GiST,
but this certainly seems to be something you've been thinking about
*systematically* eliminating for a while.

So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.

I think you should bump XLOG_PAGE_MAGIC.

Perhaps you should increase the elevel here:

+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));

Since this is a very rare occurrence that involves some subtle
interactions, I can't think why you wouldn't want to LOG this for
forensic purposes.

Why did you remove the local linkage of _bt_walk_left(), given that it
is called in exactly the same way as before? I guess that that is just
a vestige of some earlier revision.

I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that "opaque" (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the "opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto "page". So "opaque" may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the "opaque" pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets "stuck on"). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.

For a moment I thought that you might have accounted for that here:

>*************** _bt_insert_parent(Relation rel,
>*** 1685,1696 ****
> * 05/27/97
> */
> ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
>-
> pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
>
>! /* Now we can unlock the children */
> _bt_relbuf(rel, rbuf);
>- _bt_relbuf(rel, buf);
>
> /* Check for error only after writing children */
> if (pbuf == InvalidBuffer)
>--- 1767,1779 ----
> * 05/27/97
> */
> ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
> pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
>
>! /*
>! * Now we can unlock the right child. The left child will be unlocked
>! * by _bt_insertonpg().
>! */
> _bt_relbuf(rel, rbuf);
>
> /* Check for error only after writing children */
> if (pbuf == InvalidBuffer)

But this is unrelated - the buffer + pin are still dropped by a later
_bt_insertonpg(), as it says right there. Actually, to see that there
is a bug in _bt_moveright() you don't really have to look much further
than _bt_moveright(). So I apologize for taking you on that
unnecessary detour, but FWIW that was my thought process, and I like
to document that for my own benefit. :-)

Do you suppose that there are similar problems in other direct callers
of _bt_finish_split()? I'm thinking in particular of
_bt_findinsertloc().

I'm also not sure about the lock escalation that may occur within
_bt_moveright() for callers with BT_READ access - there is nothing to
prevent a caller from ending up being left with a write lock where
before they only had a read lock if they find that
!P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
conclude that there is no split finishing to be done after all. It
looks like all callers currently won't care about that, but it seems
like that should either be prominently documented, or just not occur.
These interactions are complex enough that we ought to be very
explicit about where pins are required and dropped, or locks held
before or after calling.

I suggest you consider refactoring the loop in _bt_moveright() to
reflect these subtleties. I think the structure of that routine could
use some polishing. I'm not sure that I like the way that that and
similar loops get "stuck on" page splits, although no better design
occurs to me right now.

That's all I have for now. I've written plenty of notes, and will work
back through other points of possible concern. I don't suppose you
have any testing infrastructure that you could publish?

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Kirkwood 2014-01-23 21:41:31 Re: Why do we let autovacuum give up?
Previous Message Alvaro Herrera 2014-01-23 21:34:15 Re: [bug fix] pg_ctl always uses the same event source