Re: Race condition in b-tree page deletion

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in b-tree page deletion
Date: 2013-11-09 16:49:00
Message-ID: 527E677C.8040404@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09.11.2013 18:24, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> 2. The second-simplest solution I see is to keep locked the whole chain
>> of pages that will be deleted, and delete all of them as one atomic
>> WAL-logged operation. Ie. the leaf page, and all the parent pages above
>> it that will become half-dead, and the parent of the last half-dead page.
>
> This would be more tenable if we could put a known limit on the number of
> pages to be changed at once. I'm not too awake at the moment, so maybe
> this is not possible, but could we simply decide in advance that we won't
> propagate page deletion up more than a-small-number of levels? It'd
> require allowing a page to remain half-dead until some other vacuum comes
> along and fixes it, but that seems OK.

I don't feel comfortable leaving pages in half-dead state. Looking back
at the archives, your original design ten years ago did exactly that
(http://www.postgresql.org/message-id/8281.1045089764@sss.pgh.pa.us),
but that turned out to be a bad idea
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=70ce5c908202ada7616f7afded8a91bbf2742471).
Mind you, even if we check that the half-dead page at some upper level
is eventually deletable because it's not the rightmost child, it might
become the rightmost child if we don't hold the lock so that the next
vacuum cannot delete it, and we're back to square one.

We could just punt if more than X pages would need to be changed. That
would mean that we never delete pages at the top (h - X) levels of the
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
As another data point, in the worst case the keys are so wide that only
2 keys fit on each B-tree page. With that assumption, the tree can be at
most 32 levels deep if you just insert into it with no deletions, since
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
sure). Deletions make it more complicated, but I would be pretty
surprised if you can construct a B-tree tallers than, say 40 levels.

Things gets tricky if shared_buffers is very small; with
shared_buffers=16, you certainly can't hold more than 16 buffers pinned
at once. (in fact, I think ginfast.c already has a problem with that...)

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2013-11-09 17:18:10 Re: Race condition in b-tree page deletion
Previous Message pansen 2013-11-09 16:39:17 Re: Darwin: make check fails with "child process exited with exit code 134"