Unsplitting btree index leaf pages

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Unsplitting btree index leaf pages
Date: 2005-12-22 00:00:27
Message-ID: 1135209627.2964.351.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

When we discussed online REINDEX recently we focused on the REINDEX
command itself rather than look at alternative approaches.

One reason to REINDEX is because of index page splits getting things out
of sequence and generally bloating the index.

When we VACUUM, each index is scanned in logical order.

While we scan, if we found two adjacent pages, both of which have less
than (say) 40% rows, we could re-join or "unsplit" those pages together.
The index blocks are fully locked during the read anyway and there is no
MVCC problem with moving index rows between blocks. All we have to do is
to lock both blocks, having locked them in the correct order.

The rows would always be moved to the lowest physical block id, so that
data would naturally migrate towards the start of the index file. Blocks
would then be marked half-dead just as if they had just had their last
index row removed by the vacuum.

We could start the scan by locking block 1 AND block2, then scan forward
always holding 2 locks as we go. That way we would not need to unlock
and relock the blocks to lock two blocks. The concurrency loss would not
be that great, but we would gain the ability to unsplit the two blocks
into one.

If we do this, we would could possibly avoid the need to full REINDEX
entirely.

If this method checks out we could do one of:
- make VACUUM do this always
- add an option for REINDEX: CLEAN/COMPRESS/VACUUM etc to do this upon
command only

Best Regards, Simon Riggs


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-22 00:07:23
Message-ID: 17373.1135210043@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> While we scan, if we found two adjacent pages, both of which have less
> than (say) 40% rows, we could re-join or "unsplit" those pages together.

Curiously enough, this has been thought of before. It is not as easy
as you think, or it would have been done the first time around.
nbtree/README explains why:

: We consider deleting an entire page from the btree only when it's become
: completely empty of items. (Merging partly-full pages would allow better
: space reuse, but it seems impractical to move existing data items left or
: right to make this happen --- a scan moving in the opposite direction
: might miss the items if so. We could do it during VACUUM FULL, though.)

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-22 10:41:33
Message-ID: 1135248093.2964.413.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2005-12-21 at 19:07 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > While we scan, if we found two adjacent pages, both of which have less
> > than (say) 40% rows, we could re-join or "unsplit" those pages together.
>
> Curiously enough, this has been thought of before. It is not as easy
> as you think, or it would have been done the first time around.
> nbtree/README explains why:
>
> : We consider deleting an entire page from the btree only when it's become
> : completely empty of items. (Merging partly-full pages would allow better
> : space reuse, but it seems impractical to move existing data items left or
> : right to make this happen --- a scan moving in the opposite direction
> : might miss the items if so. We could do it during VACUUM FULL, though.)

Sorry, I missed that.

"Seems impractical"? Complex, yes. But I think it sounds a lot simpler
than the online REINDEX scheme... which is also the reason enhancing
VACUUM FULL doesn't appeal.

During the first VACUUM index scan, we can identify pages that are
candidates for reorganisation. Then during the second physical scan that
follows we can reorg pages in the same manner we delete them.

We identify a page during VACUUM for reorg if:
- it is < 20% full and we want to write this page
- the following page is < 20% full and we want to write this page
- it has a higher physical block number than the following page
That way we aren't super aggressive about reorg, but the cases we do
pick have a tendency to keep the index clustered. (Perhaps that could be
settable via an index optional parameter). That definition also means we
have almost no additional I/O over what the VACUUM would have written
anyway.

Page deletion requires us to lock left, lock right, lock above. That is
exactly the same if we have identified the page for reorg, since once we
have locked left and right, we just move rows to one or both of the
those other blocks, then perform the marking half-dead.

I know you've considered these things deeply, but my viewpoint is that
when what you have is already very good the only way forward consists of
considering seemingly foolish thoughts: backtracking.

Best Regards, Simon Riggs


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-22 15:40:24
Message-ID: 23063.1135266024@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Sorry, I missed that.

And you evidently still didn't understand it. Locking both pages does
not fix the problem, because it doesn't guarantee that there's not a
concurrent indexscan in flight from one to the other. If you move items
from one page to the other in the opposite direction from the way the
scan is going, then it will miss those items. If we try to fix this by
making scans lock one page before releasing the previous, then we'll
create a bunch of deadlock cases.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-22 15:59:16
Message-ID: 20051222155915.GG21783@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 22, 2005 at 10:40:24AM -0500, Tom Lane wrote:
> And you evidently still didn't understand it. Locking both pages does
> not fix the problem, because it doesn't guarantee that there's not a
> concurrent indexscan in flight from one to the other. If you move items
> from one page to the other in the opposite direction from the way the
> scan is going, then it will miss those items. If we try to fix this by
> making scans lock one page before releasing the previous, then we'll
> create a bunch of deadlock cases.

I've been thinking, maybe there's a lockless way of going about this?
Have some kind of index modification identifier that you set at the
beginning of the index scan. What you do is mark the tuples you want to
move with and IMI (I'm trying to avoid the word transaction here) and then
copy the tuples to the new page with IMI+1. Any currently running index
scan will notice the higher IMI and ignore them. When all old index
scans are done, you can remove the old ones.

It's sort of a mini-MVCC for indexes except you could probably just use
a few states: visible to all, visible to current scan, invisible to
current scan. Or use two bits to represent frozen, 1, 2 and 3. A plain
VACUUM could do the state transistions each time it moves through the
index.

The downsides are probably that it's a pain to make it work
concurrently and requires writing each index page at least twice. But
it's a thought...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-22 20:03:20
Message-ID: 20051222200320.GC23918@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:

> The downsides are probably that it's a pain to make it work
> concurrently and requires writing each index page at least twice. But
> it's a thought...

We already do something similar for page deletions. Empty pages are not
deleted right away, but they are marked with BTP_DEAD, and then deleted
on a subsequent vacuum. Or something like that, I don't remember the
exact details.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-22 20:11:28
Message-ID: 26260.1135282288@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> We already do something similar for page deletions. Empty pages are not
> deleted right away, but they are marked with BTP_DEAD, and then deleted
> on a subsequent vacuum. Or something like that, I don't remember the
> exact details.

Right, and the reason for that is exactly that there might be a
concurrent indexscan already "in flight" to the newly-dead page.
We must wait to recycle the page until we are certain no such scans
remain.

It doesn't matter whether a concurrent indexscan visits the dead
page or not, *because it's empty* and so there's nothing to miss.
So there's no race condition. But if you try to move valid data
across pages then there is a race condition.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-22 22:05:15
Message-ID: 1135289115.2964.530.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2005-12-22 at 10:40 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Sorry, I missed that.
>
> And you evidently still didn't understand it. Locking both pages does
> not fix the problem, because it doesn't guarantee that there's not a
> concurrent indexscan in flight from one to the other. If you move items
> from one page to the other in the opposite direction from the way the
> scan is going, then it will miss those items. If we try to fix this by
> making scans lock one page before releasing the previous, then we'll
> create a bunch of deadlock cases.

Thank you for explaining things; I'm sorry you had to do it twice.

I'm ever optimistic, so I try to resist the temptation to stay quiet in
case I make a mistake.

Best Regards, Simon Riggs


From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-24 20:47:29
Message-ID: tu5rq1tni904qkuoo606o1bu16mdcubiep@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 22 Dec 2005 10:40:24 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
wrote:
>If you move items
>from one page to the other in the opposite direction from the way the
>scan is going, then it will miss those items.

AFAIU the (PG implementaion of the) L&Y method is designed to make
scans immune against problems caused by items moving right within the
same page and against page splits, i.e. items moving to a *new* right
sibling. So making scans work with items moving to an *existing*
right sibling doesn't seem out of reach.

The code following this comment in _bt_restscan
/*
* The item we're looking for moved right at least one page, so
* move right. We are careful here to pin and read-lock the next
* non-dead page before releasing the current one. This ensures
* that a concurrent btbulkdelete scan cannot pass our position
* --- if it did, it might be able to reach and delete our target
* item before we can find it again.
*/
looks like it'd work for the case of page merging as well, as long as
we are careful to move items always from left to right.

BTW, if after having locked both pages we find that we have
super-exclusive locks, i.e. nobody else has pins on these pages, we
can reorganize much more agressively. It might even be safe to move
items to the left page. The parent page might need some special
handling, though.

Servus
Manfred


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-25 00:14:19
Message-ID: 20051225001419.GA666@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > We already do something similar for page deletions. Empty pages are not
> > deleted right away, but they are marked with BTP_DEAD, and then deleted
> > on a subsequent vacuum. Or something like that, I don't remember the
> > exact details.
>
> Right, and the reason for that is exactly that there might be a
> concurrent indexscan already "in flight" to the newly-dead page.
> We must wait to recycle the page until we are certain no such scans
> remain.
>
> It doesn't matter whether a concurrent indexscan visits the dead
> page or not, *because it's empty* and so there's nothing to miss.
> So there's no race condition. But if you try to move valid data
> across pages then there is a race condition.

Hmm...

Well, REINDEX is apparently a very expensive operation right now. But
how expensive would it be to go through the entire index and perform
the index page merge operation being discussed here, and nothing else?

If it's fast enough, might it be worthwhile to implement just this
alone as a separate maintenance command (e.g., VACUUM INDEX) that
acquires the appropriate lock (AccessExclusive, I'd expect) on the
index to prevent exactly the issues you're concerned about?

If it's fast enough even on large tables, it would be a nice
alternative to REINDEX, I'd think.

--
Kevin Brown kevin(at)sysexperts(dot)com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-25 01:50:34
Message-ID: 22803.1135475434@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> BTW, if after having locked both pages we find that we have
> super-exclusive locks, i.e. nobody else has pins on these pages, we
> can reorganize much more agressively.

No, we cannot. I'm getting tired of explaining this.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unsplitting btree index leaf pages
Date: 2005-12-25 02:01:27
Message-ID: 22873.1135476087@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Well, REINDEX is apparently a very expensive operation right now. But
> how expensive would it be to go through the entire index and perform
> the index page merge operation being discussed here, and nothing else?
> If it's fast enough, might it be worthwhile to implement just this
> alone as a separate maintenance command (e.g., VACUUM INDEX) that
> acquires the appropriate lock (AccessExclusive, I'd expect) on the
> index to prevent exactly the issues you're concerned about?
> If it's fast enough even on large tables, it would be a nice
> alternative to REINDEX, I'd think.

This would work, but it's hard to tell if it'd be worthwhile short
of actually doing an implementation and field-testing it ...

regards, tom lane