Re: Race condition in b-tree page deletion

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in b-tree page deletion
Date: 2013-11-13 16:04:12
Message-ID: 5283A2FC.6040606@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12.11.2013 17:13, Robert Haas wrote:
> On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> All that is left is to unlink the all the lingering pages in the branch
>> we're deleting from their left and right siblings. This can be done at any
>> later time, and if we error out or crash for some reason, next vacuum that
>> comes along can finish the job. This is done one level at a time. Lock the
>> leaf page, and the internal page the leaf page points to, and the internal
>> page's left and right siblings (in the right order, not this order). Change
>> the left and right sibling's right- and left-links, mark the internal page
>> as deleted, and update the pointer in the leaf page to point to the child of
>> the deleted internal page. Then recurse to the child, until we reach the
>> leaf level.
>>
>> This has the nice extra property that we don't need the incomplete-action
>> tracking in WAL recovery. I'd like to get rid of that anyway.
>
> I am not very knowledgeable of this code, but at least with my limited
> knowledge I can't spot any problems with that approach. The only
> thing that seems a little unfortunate is that the next vacuum may need
> to finish cleaning things up; I thought at one point you were trying
> to get rid of the amvacuumcleanup stuff. But maybe I'm
> misremembering, and anyway it still seems like a step forward.

Ah no, it only requires the next vacuum to clean up, if the first vacuum
is interrupted for some reason. Normally, all of the steps are performed
one after another in the same vacuum process, and half-dead pages will
only be seen by backends that look at the affected pages at the same
instant. But in case of an error or crash, the next vacuum will pick up
where the previous one left off.

Here's a patch implementing this scheme. One noteworthy detail if you're
familiar with the old scheme: in this patch, only leaf pages are marked
as half-dead. Never internal pages. This is the opposite of what we do
currently; currently only internal pages can be marked half-dead, never
leaf pages. This also gives us a chance to tell apart half-dead pages
lingering after pg_upgrade from an older version, and pages marked as
half-dead with the new code. The incomplete-action mechanism should have
ensured that the former don't exist, but in case something went wrong
some time before the upgrade and there is such a page in the tree after
all, the patch gives a LOG message about it, advising to reindex.
Searches will work fine in any case, but insertions to unfortunate pages
might cause page splits and out-of-order keys in the upper levels, if
the pre-upgrade half-dead pages are not cleaned up by reindexing.

- Heikki

Attachment Content-Type Size
fix-btree-page-deletion-2.patch text/x-diff 69.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-11-13 16:04:30 Re: logical changeset generation v6.6
Previous Message Robert Haas 2013-11-13 15:59:22 Re: alter_table regression test problem