Re: Brain dump: btree collapsing

Lists: pgsql-hackers
From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 09:19:25
Message-ID: 001601c2d341$01a5e720$5e043ac8@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tom lane initially wrote:
> Restructuring the tree during page deletion
> -------------------------------------------
>
> We will delete only completely-empty pages. If we were to
> merge nearly-empty pages by moving data items from one page
> to an adjacent one, this would imply changing the parent's
> idea of the bounding key between them --- which is okay if we
> are just deleting an internal key in the parent, but what if
> the pages have different parent pages?

and a bit later wrote:
> My feeling is that what we need to fix now is index bloat during
> normal operation.

How about doing deletion of partial pages with reorganization among
sibling pages only (where the parent pages are the same)? This avoids
the "messiness" of propogating the deletes to differing parent pages but
gets most of the value of reorganization.

ISTM, that a VACUUM that only reclaims empty pages will be helpful in
certain cases but won't help much at all in many other common "normal
operation" cases which would be helped by partial reorganization.

- Curtis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Curtis Faith" <curtis(at)galtcapital(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 15:38:42
Message-ID: 13445.1045150722@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Curtis Faith" <curtis(at)galtcapital(dot)com> writes:
> ISTM, that a VACUUM that only reclaims empty pages will be helpful in
> certain cases but won't help much at all in many other common "normal
> operation" cases which would be helped by partial reorganization.

Got any evidence to back that up? I was relying on

[Johnson89] Johnson, T. and Shasha, D. Utilization of B-trees with
Inserts, Deletes and Modifies ACM Symp. on PODS, 235-246, 1989.

which provides a ton of math and simulations leading up to the
conclusion that collapsing btree pages before they are fully empty
doesn't really gain anything meaningful in terms of storage utilization,
in most scenarios.

regards, tom lane


From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 17:59:53
Message-ID: 001401c2d389$b6db2270$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tom lane wrote:
> Got any evidence to back that up? I was relying on
>
> [Johnson89] Johnson, T. and Shasha, D. Utilization of
> B-trees with Inserts, Deletes and Modifies ACM Symp. on
> PODS, 235-246, 1989.
>
> which provides a ton of math and simulations leading up to
> the conclusion that collapsing btree pages before they are
> fully empty doesn't really gain anything meaningful in terms
> of storage utilization, in most scenarios.

Well, I don't have that specific paper handy but I do have [JS93]
Theodore Johnson , Dennis Shasha, "B-trees with inserts and deletes: why
free-at-empty is better than merge-at-half" which appears to be their
later thinking on the same subject.

Note the following:

"A merge-at-half B-tree will always have a space utilization of at least
50%. When all operations are modify operations, or when the number of
insert operations is the same as the number of delete operations, then
the utilization will be about 60%. In contrast, a free-at-empty B-tree
has a 0% lower bound on its space utilization, and will have about 39%
utilization on a pure-modify instruction mix. However, the space
utilization of a free-at-empty B-tree remains high if there are just a
few more insert operations than delete operations. Thus, merge-at-half
usually buys little in terms of space utilization.

In Figure 6, we showed that the restructuring rate of a merge-at-half
B-tree is significantly larger than the restructuring rate of a
free-at-empty B-tree for all values of q * :1. For many concurrent
B-tree algorithms used in practice [4, 13], restructuring causes a
serialization bottleneck. Thus, one simple but important way to increase
concurrency in B-tree operations is to reduce the probability of
restructuring. Since merge-at-half buys little space utilization but is
expensive in terms of restructuring, we recommend that B-trees
(especially concurrently accessed ones) use free-at-empty."

I don't dispute their conclusions in that context and under the
circumstances they outline of random distribution of deletion and
insertion values for the index keys.

However, as [Jannink]: "Implementing Deletion in B+-trees. SIGMOD
RECORD, v.24, n.1, p.33-38, 1995" points out that assumption doesn't
hold under other possibly common circumstances, specifically
circumstances where the deletes are taking place in significant sections
of the index at a much higher rate than inserts.

Consider from [Jannink95]:

"There has been some research on the acceptability of relaxing the
constraint of minimum node size to reduce the number of so-called unsafe
tree operations, i.e., those which contain node splitting and merging
[ZH89].

The debate has culminated in analysis of a weaker form of the deletion
algorithm which we call lazy deletion, that imposes no constraints on
the number of entries left in the nodes, allowing them to empty
completely before simply removing them. According to [GR93], most
database system implementations of B+-trees have adopted this approach.
Its most effective use is when it is vital to allow concurrent access to
the tree [JS93b], and excessive splitting and merging of nodes would
restrict concurrency. [JS89] derives some analytic solutions calculating
memory utilization for B+-trees under a mix of insertions and lazy
deletions, based on previous research which considered insertions only
[BY89]. The simulations in [JS89] support its analysis to show that in
typical situations, where deletions don't outnumber insertions in the
mix of operations, the tree nodes will contain acceptable percentages of

entries.

One of the work's assumptions [JS93a] is that the keys and tree
operations are chosen uniformly from a random distribution. This
assumption is unreasonable in certain realistic situations such as one
described below. Allowing interior nodes with only a single pointer to
exist in a B+-tree creates the possibility for arbitrarily unbalanced
trees of any height, which are virtually empty, and in which access
times have degenerated from the logarithmic bound B+-trees are meant to
guarantee to a worst case unbounded access time. Since nodes are not
removed until they are completely empty, the lazy deletion algorithm
does not regulate tree height effectively."

Jannink then illustrates an example where an index is created based on a
timestamp where the basic assumption of Johnson and Sasha does not hold
since it is not a random distribution but a monotonically increasing
value. His example is an extreme one but I believe there are many
instances where a timestamp, sequence or some other monotonically
increasing value is used in an index and where deletes are taking place
much more frequently for largely older values.

Since sequences are often used as foreign keys a significant number of
indexes fit into this category.

Consider a web site that tracked users and that deleted inactive
accounts. There are many real-world scenarios where the number of
inactive accounts is very high as a percentage of all accounts. Consider
an example where 50% of the accounts are inactive and deleted after 90
days of disuse and 90% are inactive after 180 days.

If the Account is implemented with an ID based on a sequence, any tables
that referenced the inactive account's sequence based ID via foreign
keys would have index entries that would be sparsely populated for the
older entries but might not have a high percentage of completely empty
index pages.

The older, but still active accounts would keep the index pages from
being completely empty in many cases.

Further, Johnson and Sasha make the claim that "free-at-empty is better"
in the specific context of restructuring causing a serialization
bottleneck. I don't think this applies to the specific case I
recommended where the parent is the same for both leaves, especially
during a VACUUM process where presumably we can optimize for concurrency
rather than the speed of VACUUM itself.

- Curtis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Curtis Faith" <curtis(at)galtcapital(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 18:10:01
Message-ID: 20999.1045159801@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Curtis Faith" <curtis(at)galtcapital(dot)com> writes:
> I don't dispute their conclusions in that context and under the
> circumstances they outline of random distribution of deletion and
> insertion values for the index keys. [But the random-distribution
> assumption doesn't always hold.]

That's a fair point. Nonetheless, we have little choice: we cannot
move keys around during concurrent operations. If we push keys to
the right, we may cause an indexscan moving left to miss them, and
vice versa. So we can only eliminate empty pages.

We could possibly allow VACUUM FULL to collapse partly-full pages,
since in that case we know there are no concurrent scans.

regards, tom lane


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Curtis Faith <curtis(at)galtcapital(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 21:32:36
Message-ID: 1045171956.1630.4.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane kirjutas N, 13.02.2003 kell 20:10:
> "Curtis Faith" <curtis(at)galtcapital(dot)com> writes:
> > I don't dispute their conclusions in that context and under the
> > circumstances they outline of random distribution of deletion and
> > insertion values for the index keys. [But the random-distribution
> > assumption doesn't always hold.]
>
> That's a fair point. Nonetheless, we have little choice: we cannot
> move keys around during concurrent operations. If we push keys to
> the right, we may cause an indexscan moving left to miss them, and
> vice versa. So we can only eliminate empty pages.

But if we would allow the scans to find the same keys twice without ill
effects (as was suggested earlier, for using btrees to index arrays),
then we could possibly 1) copy the key to the right 2) wait for all
right-to-left scans that have fallen between old and new values to pass
and only then 3) delete the "old left" key.

This could solve the concurrency issue as well.

> We could possibly allow VACUUM FULL to collapse partly-full pages,
> since in that case we know there are no concurrent scans.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)tm(dot)ee>
Cc: Curtis Faith <curtis(at)galtcapital(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 23:13:55
Message-ID: 1776.1045178035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing <hannu(at)tm(dot)ee> writes:
> But if we would allow the scans to find the same keys twice without ill
> effects (as was suggested earlier, for using btrees to index arrays),

How is returning the same data twice not an "ill effect"?

> then we could possibly 1) copy the key to the right 2) wait for all
> right-to-left scans that have fallen between old and new values to pass
> and only then 3) delete the "old left" key.

How will you wait for scans that you know nothing of to go past?
Especially when they are going to be blocked by your own write lock
on the left page?

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)tm(dot)ee>, Curtis Faith <curtis(at)galtcapital(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-13 23:18:16
Message-ID: 200302132318.h1DNIGt28322@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Hannu Krosing <hannu(at)tm(dot)ee> writes:
> > But if we would allow the scans to find the same keys twice without ill
> > effects (as was suggested earlier, for using btrees to index arrays),
>
> How is returning the same data twice not an "ill effect"?
>
> > then we could possibly 1) copy the key to the right 2) wait for all
> > right-to-left scans that have fallen between old and new values to pass
> > and only then 3) delete the "old left" key.
>
> How will you wait for scans that you know nothing of to go past?
> Especially when they are going to be blocked by your own write lock
> on the left page?

I think we should skip any pages where we can't get a lock.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Curtis Faith <curtis(at)galtcapital(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 07:07:27
Message-ID: 1045206446.1386.6.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane kirjutas R, 14.02.2003 kell 01:13:
> Hannu Krosing <hannu(at)tm(dot)ee> writes:
> > But if we would allow the scans to find the same keys twice without ill
> > effects (as was suggested earlier, for using btrees to index arrays),
>
> How is returning the same data twice not an "ill effect"?

>From earlier discussions I understood that there had been some work done
on using btrees for indexing arrays by storing each separate element in
a löeaf node. Surely that work must deal with not returning the same
tuple twice.

> > then we could possibly 1) copy the key to the right 2) wait for all
> > right-to-left scans that have fallen between old and new values to pass
> > and only then 3) delete the "old left" key.
>
> How will you wait for scans that you know nothing of to go past?
> Especially when they are going to be blocked by your own write lock
> on the left page?

could we just not lock (for more than just to ensure atomic writes) the
page but instead increment a "page version" on each write to detect
changes?

------------
Hannu


From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 08:15:23
Message-ID: 011001c2d401$391b4720$3201a8c0@beeblebrox
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing <hannu(at)tm(dot)ee> wrote:

> could we just not lock (for more than just to ensure atomic writes) the
> page but instead increment a "page version" on each write to detect
> changes?

Sounds like Index MVCC..., very nice. ;-)
(Of course I have no clue about feasibility, just liked the idea)

Best Regards,
Michael Paesold


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)tm(dot)ee>
Cc: Curtis Faith <curtis(at)galtcapital(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 14:32:07
Message-ID: 6693.1045233127@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing <hannu(at)tm(dot)ee> writes:
> Tom Lane kirjutas R, 14.02.2003 kell 01:13:
>> How is returning the same data twice not an "ill effect"?

> From earlier discussions I understood that there had been some work done
> on using btrees for indexing arrays by storing each separate element in
> a leaf node. Surely that work must deal with not returning the same
> tuple twice.

The only mechanism that exists for that is to discard tuples that meet
the qualification tests of previous indexscans. This cannot prevent
returning the same tuple twice in one scan, if the index is so
ill-behaved as to return the same pointer twice. I don't know what
Vadim had in mind to support multiple index entries per tuple, but
it's certainly not in the code yet.

>> How will you wait for scans that you know nothing of to go past?
>> Especially when they are going to be blocked by your own write lock
>> on the left page?

> could we just not lock (for more than just to ensure atomic writes) the
> page but instead increment a "page version" on each write to detect
> changes?

How does that help? The left-moving indexscan still has no way to
recover. It can't go back to the page it was on before and try to
determine which entries you added there, because it has no reliable
reference point to do so. The entry it previously returned might not be
there anymore, and in a non-unique index key comparisons won't help.
And even more to the point, how would it know you've changed the left
page? It has no idea what the previous "page version" on the left page
was, because it was never there before.

regards, tom lane


From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Hannu Krosing'" <hannu(at)tm(dot)ee>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 15:44:04
Message-ID: 000701c2d43f$e8962f60$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tom lane wrote:

> How does that help? The left-moving indexscan still has no
> way to recover. It can't go back to the page it was on
> before and try to determine which entries you added there,
> because it has no reliable reference point to do so. The
> entry it previously returned might not be there anymore, and
> in a non-unique index key comparisons won't help. And even
> more to the point, how would it know you've changed the left
> page? It has no idea what the previous "page version" on the
> left page was, because it was never there before.

Another way this could be handled is by not merging onto one of the
existing pages but to a brand new page, a kind of special case shadow
index page. That way the sibling pointers, and leaf page pointer in the
parent could all be updated atomically to point to the new page.

In-process index scans would still reference the merged pages which
would not be deleted but marked as dead using a mechanism like you
proposed for marking empty pages dead with the next-transaction-ID
counter.

Merging could be done after a VACUUM pass that performs deletion of
empty pages in order to provide a pool of empty pages to use for the
merge. This would keep the index from temporarily growing during the
merge process.

A similar approach could be used to reorder the index pages in the
background. An index that was reordered to fairly closely reflect the
tree as a breadth first traversal would provide much faster index scans
if the file is not heavily fragmented.

- Curtis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Curtis Faith" <curtis(at)galtcapital(dot)com>
Cc: "'Hannu Krosing'" <hannu(at)tm(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 16:19:56
Message-ID: 7728.1045239596@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Curtis Faith" <curtis(at)galtcapital(dot)com> writes:
> Another way this could be handled is by not merging onto one of the
> existing pages but to a brand new page, a kind of special case shadow
> index page.

Hmmm ... that might be made to work, but it would complicate inserts.
By the time an insert navigates to the page it should insert on, it
might find the page is dead, and then it would have no easy way to get
to the replacement page (unless you want to dedicate another link field
in every index page for that one use). I suppose it could recover by
starting the search over again.

Another problem is that the two dead pages are now outside of the btree
structure and so their left and right links won't get updated in the
face of subsequent splits and merges of the nearby pages. I spent quite
a bit of time sweating over the recovery navigation details in my
original proposal; I can assure you it's not easy to get right. You can
chain right from a dead page with some reliability, but left is another
matter. There's no good way to know, when following a left link,
whether you've arrived at the proper place or need to chain right to
make up for a subsequent split. The recovery procedure I proposed works
for removing single pages, but it won't work with substituted pages.

regards, tom lane


From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Hannu Krosing'" <hannu(at)tm(dot)ee>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 18:00:39
Message-ID: 000a01c2d452$fd3e4ac0$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tom lane wrote:
> How does that help? The left-moving indexscan still has no
> way to recover. It can't go back to the page it was on
> before and try to determine which entries you added there,
> because it has no reliable reference point to do so. The
> entry it previously returned might not be there anymore, and
> in a non-unique index key comparisons won't help. And even
> more to the point, how would it know you've changed the left
> page? It has no idea what the previous "page version" on the
> left page was, because it was never there before.

Revisiting the idea I proposed in a previous email after more research,
I believe it is definitely possible to implement a mutex based scheme
that will prevent scans from being polluted by merges of index pages and
that does not result in the mutex being held for any significant
duration.

1) Current scan code does this:

bool
_bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
... definitions go here...

if (ScanDirectionIsForward(dir))
{
if (!PageIsEmpty(page) && offnum < maxoff)
offnum = OffsetNumberNext(offnum);
else
{
/* walk right to the next page with data */
for (;;)
{
/* if we're at end of scan, release the
buffer and return */
... skip code here...

/* step right one page */
blkno = opaque->btpo_next;
_bt_relbuf(rel, *bufP);
*bufP = _bt_getbuf(rel, blkno, BT_READ);

... skip rest of code...

3) Note that the calls

_bt_relbuf(rel, *bufP);
*bufP = _bt_getbuf(rel, blkno, BT_READ);

a) appear adjacent to each other
b) relbuf calls:

LockBuffer(buf, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buf);

c) getbuf only calls:

buf = ReadBuffer(rel, blkno);
LockBuffer(buf, access);

in the case of an existing buffer, rather than a new one.

4) This could easily be reordered into:

buf = ReadBuffer(rel, blkno); /* pin next page
*/
LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on
current page */
LockBuffer(buf, BT_READ); /* lock next page */
ReleaseBuffer(buf); /* now release pin on
previously current page */

without affecting the logic of the code or causing any deadlock
problems since the release still occurs before the lock of the next
page.

5) A mutex/spinlock that was stored in the index could be acquired by
the scan code like this:

buf = ReadBuffer(rel, blkno); /* pin next page
*/

SpinLockAcquire( indexSpecificMutex ); /* lock the index
reorg mutex */

LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on
current page */
LockBuffer(buf, BT_READ); /* lock next page */

SpinLockRelease( indexSpecificMutex ); /* unlock the index
reorg mutex */

ReleaseBuffer(buf); /* now release pin on
previously current page */

6) The same index specific mutex/spinlock could be used for the merge
code surrounding only the acquisition of the four page locks. This would
obviate any problems with scans and page merges, since the lock
acquisition for the merge could never occur while a scan was between
pages.

Further, with the reordering, the spinlock for the scan code doesn't
seem particularly onerous since it would be surrounding only two LWLock
calls. To reduce the overhead to an absolute minimum for the scan case
these could be pushed down into a new IW call (probably necessary since
the LockBuffer, ReleaseBuffer code does some error checking and such
that one wouldn't want in code guarded by a mutex.

- Curtis


From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 19:22:54
Message-ID: 000f01c2d45e$7a949f50$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tom lane wrote:
> Hmmm ... that might be made to work, but it would complicate
> inserts. By the time an insert navigates to the page it
> should insert on, it might find the page is dead, and then it
> would have no easy way to get to the replacement page (unless
> you want to dedicate another link field in every index page
> for that one use). I suppose it could recover by starting
> the search over again.

Inserts could reread just the parent page if they encountered a dead
leaf since the parent would have been correctly updated.

> Another problem is that the two dead pages are now outside of
> the btree structure and so their left and right links won't
> get updated in the face of subsequent splits and merges of
> the nearby pages.

That seems like a "show stopper" that just defers the problem. A split
of the left sibling would still screw up a scan that was using the old
left leaf page and wanted to move left.

Oh, well, the idea seemed worth exploring.

- Curtis


From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Curtis Faith'" <curtis(at)galtcapital(dot)com>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Hannu Krosing'" <hannu(at)tm(dot)ee>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 19:25:54
Message-ID: 001001c2d45e$e5db3d00$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I previously wrote:
> 5) A mutex/spinlock that was stored in the index could be
> acquired by the scan code like this:
>
> buf = ReadBuffer(rel, blkno); /* pin next page
> */
>
> SpinLockAcquire( indexSpecificMutex ); /* lock the index
> reorg mutex */
>
> LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on
> current page */
> LockBuffer(buf, BT_READ); /* lock next page */
>
> SpinLockRelease( indexSpecificMutex ); /* unlock the index
> reorg mutex */
>
> ReleaseBuffer(buf); /* now release pin on
> previously current page */
>
> 6) The same index specific mutex/spinlock could be used for
> the merge code surrounding only the acquisition of the four
> page locks. This would obviate any problems with scans and
> page merges, since the lock acquisition for the merge could
> never occur while a scan was between pages.
>
> Further, with the reordering, the spinlock for the scan code
> doesn't seem particularly onerous since it would be
> surrounding only two LWLock calls. To reduce the overhead to
> an absolute minimum for the scan case these could be pushed
> down into a new IW call (probably necessary since the
> LockBuffer, ReleaseBuffer code does some error checking and
> such that one wouldn't want in code guarded by a mutex.

I forgot to mention that the mutex would have to be release in the event
the next page lock could not be immediately acquired just after the
addition of the scan process to the lock waiters list to avoid blocking
all scans and probably causing severe deadlock problems.

- Curtis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Curtis Faith" <curtis(at)galtcapital(dot)com>
Cc: "'Hannu Krosing'" <hannu(at)tm(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 20:18:57
Message-ID: 9218.1045253937@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Curtis Faith" <curtis(at)galtcapital(dot)com> writes:
> 4) This could easily be reordered into:

> buf = ReadBuffer(rel, blkno); /* pin next page
> */
> LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on
> current page */
> LockBuffer(buf, BT_READ); /* lock next page */
> ReleaseBuffer(buf); /* now release pin on
> previously current page */

> without affecting the logic of the code or causing any deadlock
> problems since the release still occurs before the lock of the next
> page.

Sorry, that *does* create deadlocks. Remember the deletion process is
going to need superexclusive lock (not only a BT_WRITE buffer lock,
but no concurrent pins) in order to be sure there are no scans stopped
on the page it wants to delete. (In the above pseudocode, the fact that
you still hold a pin on the previously-current page makes you look
exactly like someone who's in the middle of scanning that page, rather
than trying to leave it.) The same would be true of both pages
if it's trying to merge.

> 5) A mutex/spinlock that was stored in the index could be acquired by
> the scan code like this:

"Stored in the index"? And how will you do that portably? But it
doesn't matter, because the approach deadlocks.

regards, tom lane


From: "Curtis Faith" <curtis(at)galtcapital(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Hannu Krosing'" <hannu(at)tm(dot)ee>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 20:41:06
Message-ID: 001801c2d469$66ff09c0$a200a8c0@curtislaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tom lane wrote:
> Sorry, that *does* create deadlocks. Remember the deletion
> process is going to need superexclusive lock (not only a
> BT_WRITE buffer lock, but no concurrent pins) in order to be
> sure there are no scans stopped on the page it wants to
> delete. (In the above pseudocode, the fact that you still
> hold a pin on the previously-current page makes you look
> exactly like someone who's in the middle of scanning that
> page, rather than trying to leave it.) The same would be
> true of both pages if it's trying to merge.

First, recall that under my very first proposal, the VACUUM process
would try to acquire locks but NOT WAIT. Only in the event that
superexclusive locks could be obtained on all pages would the merge
proceed, otherwise it would DROP all the locks, sleep and retry. This
would prevent the VACUUM merge from participating in deadlocks since it
would never wait while holding any lock.

I was assuming that here as well but did not explicitly restate this,
sorry.

One also needs to drop the mutex in the event you could not get the lock
after placing the process in the waiter list for the next page.

This entry will prevent VACUUM that wants to merge from gaining the
superexclusive lock until after the scan has finished since the scans
waiting lock request will block it, and as you point out, so will the
pin.

The mutex only needs to guard the crossing of the pages, so the pin
existing outside the mutex won't cause a problem.

> "Stored in the index"? And how will you do that portably?

Sorry for the lack of rigorous language. I meant that there would be one
mutex per index stored in the header or internal data structures
associated with each index somewhere. Probably in the same structure the
root node reference for each btree is stored.

- Curtis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Curtis Faith" <curtis(at)galtcapital(dot)com>
Cc: "'Hannu Krosing'" <hannu(at)tm(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brain dump: btree collapsing
Date: 2003-02-14 22:33:28
Message-ID: 11687.1045262008@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Curtis Faith" <curtis(at)galtcapital(dot)com> writes:
>> "Stored in the index"? And how will you do that portably?

> Sorry for the lack of rigorous language. I meant that there would be one
> mutex per index stored in the header or internal data structures
> associated with each index somewhere. Probably in the same structure the
> root node reference for each btree is stored.

Hm. A single lock that must be grabbed for operations anywhere in the
index is a concurrency bottleneck. But maybe we could avoid that.

regards, tom lane