Block B-Tree concept

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Block B-Tree concept
Date: 2006-09-26 10:16:54
Message-ID: 4518FE16.1020507@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been experimenting with the idea of a so-called Block B-Tree. The
basic idea is that instead of storing an index tuple for each heap
tuple, we store an index tuple for each heap block. This dramatically
reduces the size of an index, leading to savings on I/O. This idea was
briefly discussed in January:
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00565.php

To make it actually work, the semantics of the B-Tree has been modified
so that every index tuple represents 1 or more heap tuples that fall
within some range of values, and are on the same heap page. The range
that an index tuple represents is from X, inclusive, to Y, exclusive,
where X is the key of the index tuple and Y is the key of the *next*
index tuple in the index. If the heap is in index order (as after
CLUSTER), we get a very compact index this way, effectively eliminating
the leaf level of the B-tree.

To locate the actual matching items on the heap page, we have to scan
the heap page because we don't have the item ids, so this is a tradeoff
between CPU and I/O. However, we could have a hybrid where we initially
store heap tuple pointers like we do now, and when there's more than X
consecutive pointers to the same page, we collapse them to just one
pointer to the whole page. This would potentially give us the best of
both worlds.

This design is more flexible and less invasive than true
index-organized-tables, because it doesn't require perfect ordering of
the heap or moving heap tuples around. You can also define than one
Block B-Tree on a table, though you wouldn't get much benefit from using
Block B-Tree instead of normal B-Tree if there's no correlation between
the index order and the heap order.

It also fits in nicely with the bitmap scans, since there's already
support for "lossy" bitmap pages. Doing normal ordered index scans
requires some coding, but can be done.

Thoughts? I'm thinking of getting this in early in the 8.3 cycle. We'll
have to take a look at the indexam API to support both bitmap indexes
and block B-trees nicely.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 11:14:46
Message-ID: 22939.1159269286@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I've been experimenting with the idea of a so-called Block B-Tree. The
> basic idea is that instead of storing an index tuple for each heap
> tuple, we store an index tuple for each heap block. This dramatically
> reduces the size of an index, leading to savings on I/O.

VACUUM?

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 11:26:30
Message-ID: 45190E66.2090805@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>
>> I've been experimenting with the idea of a so-called Block B-Tree. The
>> basic idea is that instead of storing an index tuple for each heap
>> tuple, we store an index tuple for each heap block. This dramatically
>> reduces the size of an index, leading to savings on I/O.
>>
>
> VACUUM?
>
There's a few options that I've thought of this far:

1. Whenever a tuple is found dead on page X, vacuum of the index will
have to go to that page again to see if there's any matching tuples left.

2. Have a reference counter on index tuple that's increased on insert
and decreased by vacuum.

3. Do nothing. Let index scans mark the index tuple as dead when it's
convenient. There's no correctness problem with just leaving dead index
tuples there, because you have to check the index quals on each heap
tuple anyway when you scan.

3. probably isn't an option in itself, but we might want to do some kind
of a mixture of 1 and 3.

I'm thinking that Block B-Tree is not a candidate for non-MVCC system
catalogs, but I think that's OK.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 12:13:06
Message-ID: 200609261213.k8QCD6P06043@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> >
> >> I've been experimenting with the idea of a so-called Block B-Tree. The
> >> basic idea is that instead of storing an index tuple for each heap
> >> tuple, we store an index tuple for each heap block. This dramatically
> >> reduces the size of an index, leading to savings on I/O.
> >>
> >
> > VACUUM?
> >
> There's a few options that I've thought of this far:
>
> 1. Whenever a tuple is found dead on page X, vacuum of the index will
> have to go to that page again to see if there's any matching tuples left.

Right now, if an index entry points to a dead tuple, we set a bit in
the index so future lookups do not access the heap. We could set a bit
for block index entries that point to a page that has no live rows, and
have vacuum remove the index entry later.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 12:29:30
Message-ID: 45191D2A.7010401@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Right now, if an index entry points to a dead tuple, we set a bit in
> the index so future lookups do not access the heap. We could set a bit
> for block index entries that point to a page that has no live rows, and
> have vacuum remove the index entry later.

GIN don't support this feature...
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 12:32:59
Message-ID: 200609261232.k8QCWxN07210@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev wrote:
> > Right now, if an index entry points to a dead tuple, we set a bit in
> > the index so future lookups do not access the heap. We could set a bit
> > for block index entries that point to a page that has no live rows, and
> > have vacuum remove the index entry later.
>
> GIN don't support this feature...

I think block indexes would only be for btrees.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 12:37:01
Message-ID: 45191EED.2080502@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev wrote:
>> Right now, if an index entry points to a dead tuple, we set a bit in
>> the index so future lookups do not access the heap. We could set a
>> bit for block index entries that point to a page that has no live
>> rows, and
>> have vacuum remove the index entry later.
>
> GIN don't support this feature...
I'm only talking about B-trees at this stage. ISTM that you could do the
same thing with hash indexes, but I haven't given it much thought.

Anyway, I think you'd usually want to use bitmap scans with a Block
B-tree, unless you need sorted output. And bitmap scans don't set the
LP_DELETE flag either. We might want to do something about that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 12:51:10
Message-ID: 23964.1159275070@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> VACUUM?
>>
> There's a few options that I've thought of this far:

> 1. Whenever a tuple is found dead on page X, vacuum of the index will
> have to go to that page again to see if there's any matching tuples left.

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

> 2. Have a reference counter on index tuple that's increased on insert
> and decreased by vacuum.

The "increase on insert" part I understand, the "decrease by vacuum"
part seems to have the same problem as #1. How do you tell which index
entries should be changed?

> 3. Do nothing. Let index scans mark the index tuple as dead when it's
> convenient. There's no correctness problem with just leaving dead index
> tuples there, because you have to check the index quals on each heap
> tuple anyway when you scan.

And we're back to routine REINDEX I guess :-(. This doesn't seem like a
satisfactory answer.

regards, tom lane


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 13:15:35
Message-ID: 1159276535.3874.149.camel@coppola.muc.ecircle.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> And we're back to routine REINDEX I guess :-(. This doesn't seem like a
> satisfactory answer.

If the reindex works online, it could be a satisfactory solution.

Cheers,
Csaba.


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 16:11:34
Message-ID: 45195136.8000309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Anything that involves having VACUUM re-evaluate index expressions is a
> nonstarter ... or have you already forgotten the optimizations we put
> into 8.2 that assume, eg, no sub-transactions within a VACUUM?

Umm, I'm afraid I have. Could you give me a clue?

>> 3. Do nothing. Let index scans mark the index tuple as dead when it's
>> convenient. There's no correctness problem with just leaving dead index
>> tuples there, because you have to check the index quals on each heap
>> tuple anyway when you scan.
>
> And we're back to routine REINDEX I guess :-(. This doesn't seem like a
> satisfactory answer.

In general, it isn't.

Though there are interesting use cases where it would be fine. For
example, if you remove old data by dropping a partition, there's never
really need to vacuum. Or if all of the data is accessed during normal
operation, the index scans set the LP_DELETE flags and no additional
vacuum is really needed.

Also, now that we have concurrent CREATE INDEX, we could implement
concurrent REINDEX as well, I believe.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 16:27:01
Message-ID: 20060926162701.GB6330@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Tom Lane wrote:
> >Anything that involves having VACUUM re-evaluate index expressions is a
> >nonstarter ... or have you already forgotten the optimizations we put
> >into 8.2 that assume, eg, no sub-transactions within a VACUUM?
>
> Umm, I'm afraid I have. Could you give me a clue?

The patch by Hannu Krossing that allows a lazy vacuum to ignore other
lazy vacuums, when computing the OldestXmin.

The point is you can only ignore other lazy vacuums if they are not
going to visit the heap. Otherwise you'd risk removing a tuple that the
other vacuum wants to see.

You could here shout that surely no two vacuums could be trying to clean
the same table at the same time, but it turns out that you can have
functional indexes that scan other tables. Sure, it's a bad idea, but ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 16:27:56
Message-ID: 4519550C.9090704@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Anything that involves having VACUUM re-evaluate index expressions is a
>> nonstarter ... or have you already forgotten the optimizations we put
>> into 8.2 that assume, eg, no sub-transactions within a VACUUM?
>
> Umm, I'm afraid I have. Could you give me a clue?
I think I found it. Is this what you're talking about (in
commands/vacuum.c):

/*
* During a lazy VACUUM we do not run any user-supplied functions,
* and so it should be safe to not create a transaction snapshot.
*
* We can furthermore set the inVacuum flag, which lets other
* concurrent VACUUMs know that they can ignore this one while
* determining their OldestXmin. (The reason we don't set inVacuum
* during a full VACUUM is exactly that we may have to run user-
* defined functions for functional indexes, and we want to make
* sure that if they use the snapshot set above, any tuples it
* requires can't get removed from other tables. An index function
* that depends on the contents of other tables is arguably broken,
* but we won't break it here by violating transaction semantics.)
*
* Note: the inVacuum flag remains set until CommitTransaction or
* AbortTransaction. We don't want to clear it until we reset
* MyProc->xid/xmin, else OldestXmin might appear to go backwards,
* which is probably Not Good.
*/
MyProc->inVacuum = true;

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 16:39:01
Message-ID: 27037.1159288741@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> Anything that involves having VACUUM re-evaluate index expressions is a
>>> nonstarter ... or have you already forgotten the optimizations we put
>>> into 8.2 that assume, eg, no sub-transactions within a VACUUM?

> I think I found it. Is this what you're talking about (in
> commands/vacuum.c):

That's part of it, but I seem to recall other things too --- in
particular, the point about subtransactions troubles me. Whatever you
think about an index function looking at other tables, it is perfectly
legitimate to have an exception block in an index function, and that
requires subtransactions (at least as plpgsql is now implemented).

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-26 16:43:25
Message-ID: 27160.1159289005@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Also, now that we have concurrent CREATE INDEX, we could implement
> concurrent REINDEX as well, I believe.

That's probably more easily said than done --- in particular, I don't
understand what the committed state after the first transaction would
look like. CREATE INDEX can get away with it because nothing need be
depending on the new index, but you can't say that for an existing index
(esp. if it's UNIQUE).

regards, tom lane


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 02:47:56
Message-ID: 20060927024756.GA19827@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 26, 2006 at 11:16:54AM +0100, Heikki Linnakangas wrote:
> To locate the actual matching items on the heap page, we have to scan
> the heap page because we don't have the item ids, so this is a tradeoff
> between CPU and I/O. However, we could have a hybrid where we initially
> store heap tuple pointers like we do now, and when there's more than X
> consecutive pointers to the same page, we collapse them to just one
> pointer to the whole page. This would potentially give us the best of
> both worlds.
>
> This design is more flexible and less invasive than true
> index-organized-tables, because it doesn't require perfect ordering of
> the heap or moving heap tuples around. You can also define than one
> Block B-Tree on a table, though you wouldn't get much benefit from using
> Block B-Tree instead of normal B-Tree if there's no correlation between
> the index order and the heap order.

No, but I think there's scenarios where you may not have extremely high
correlation but you'd still benefit, especially with the hybrid
approach. If you have a field with rather skewed histogram, for example,
where you're likely to have a lot of tuples with one value on any given
page. Of course, you probably would want to exclude that value from the
index entirely if it's on *every* page, but anything where you see
paterns of data wouldn't be like that.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 05:50:57
Message-ID: 20060927055057.GB19827@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 26, 2006 at 05:27:56PM +0100, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> >Tom Lane wrote:
> >>Anything that involves having VACUUM re-evaluate index expressions is a
> >>nonstarter ... or have you already forgotten the optimizations we put
> >>into 8.2 that assume, eg, no sub-transactions within a VACUUM?
> >
> >Umm, I'm afraid I have. Could you give me a clue?
> I think I found it. Is this what you're talking about (in
> commands/vacuum.c):
>
> /*
> * During a lazy VACUUM we do not run any user-supplied functions,
> * and so it should be safe to not create a transaction snapshot.
> *
> * We can furthermore set the inVacuum flag, which lets other
> * concurrent VACUUMs know that they can ignore this one while
> * determining their OldestXmin. (The reason we don't set inVacuum
> * during a full VACUUM is exactly that we may have to run user-
> * defined functions for functional indexes, and we want to make
> * sure that if they use the snapshot set above, any tuples it
> * requires can't get removed from other tables. An index function
> * that depends on the contents of other tables is arguably broken,
> * but we won't break it here by violating transaction semantics.)
> *
> * Note: the inVacuum flag remains set until CommitTransaction or
> * AbortTransaction. We don't want to clear it until we reset
> * MyProc->xid/xmin, else OldestXmin might appear to go backwards,
> * which is probably Not Good.
> */
> MyProc->inVacuum = true;

Do I understand that to mean that you can no longer lazy vacuum a
functional index?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 05:52:53
Message-ID: 20060927055253.GC19827@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 26, 2006 at 08:51:10AM -0400, Tom Lane wrote:
> > 3. Do nothing. Let index scans mark the index tuple as dead when it's
> > convenient. There's no correctness problem with just leaving dead index
> > tuples there, because you have to check the index quals on each heap
> > tuple anyway when you scan.
>
> And we're back to routine REINDEX I guess :-(. This doesn't seem like a
> satisfactory answer.

Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
that anyway right now?

Granted, you'd want to periodically ensure that you scan the entire
index, but that shouldn't be horribly hard to set up.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 08:17:52
Message-ID: 451A33B0.6000305@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> Do I understand that to mean that you can no longer lazy vacuum a
> functional index?

With the normal B-trees we have now, there's no problem vacuuming a
functional index. But it would be a problem with the Block B-tree,
because the proposed way of vacuuming a Block B-tree involves
re-evaluating index expressions.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 08:23:02
Message-ID: 451A34E6.8020705@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
> that anyway right now?

You mean _index_ tuples marked dead? Sure, no problem there.

> Granted, you'd want to periodically ensure that you scan the entire
> index, but that shouldn't be horribly hard to set up.

Well, it seems to be. A vacuum can't evaluate index expressions because
it's not in a real transaction.

The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0"
etc. with enable_seqscan=false? That would work, but we can't depend on
an additional administrative task like. And we might as well just
disable the optimization that's causing us problems.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 09:38:38
Message-ID: 200609270938.k8R9ccC05084@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
> > Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
> > that anyway right now?
>
> You mean _index_ tuples marked dead? Sure, no problem there.
>
> > Granted, you'd want to periodically ensure that you scan the entire
> > index, but that shouldn't be horribly hard to set up.
>
> Well, it seems to be. A vacuum can't evaluate index expressions because
> it's not in a real transaction.
>
> The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0"
> etc. with enable_seqscan=false? That would work, but we can't depend on
> an additional administrative task like. And we might as well just
> disable the optimization that's causing us problems.

Why can't the C code just do a full index scan that touches the heap,
sets those expired bits, and then do a vacuum? My point is that the
bits can be set outside the normal vacuum process, so you don't have to
be doing heap lookups from the index inside vacuum.

Assuming the heap is mostly in index order, the full index scan
shouldn't take much longer than a heap scan, and if the heap doesn't
match index order, a block index shouldn't be used anyway.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 10:11:24
Message-ID: 451A4E4C.1000402@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> Jim C. Nasby wrote:
>>> Granted, you'd want to periodically ensure that you scan the entire
>>> index, but that shouldn't be horribly hard to set up.
>> Well, it seems to be. A vacuum can't evaluate index expressions because
>> it's not in a real transaction.
>>
>> The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0"
>> etc. with enable_seqscan=false? That would work, but we can't depend on
>> an additional administrative task like. And we might as well just
>> disable the optimization that's causing us problems.
>
> Why can't the C code just do a full index scan that touches the heap,
> sets those expired bits, and then do a vacuum? My point is that the
> bits can be set outside the normal vacuum process, so you don't have to
> be doing heap lookups from the index inside vacuum.

The point of the optimization that's causing problems was to reduce the
effect of long-running vacuum transactions. If we're going to have
another long running transaction instead, we're back to square one.

AFAICS, we could disable the optimization and use a full-blown
transaction when vacuuming a table with a functional block index.
Granted, that's annoying, but not a show-stopper I think.

> Assuming the heap is mostly in index order, the full index scan
> shouldn't take much longer than a heap scan, and if the heap doesn't
> match index order, a block index shouldn't be used anyway.

It introduces one more full heap scan for each block index on a table.
That's expensive.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 14:10:09
Message-ID: 8583.1159366209@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> AFAICS, we could disable the optimization and use a full-blown
> transaction when vacuuming a table with a functional block index.

No, we couldn't, or at least it's not merely a matter of reversing a
recent optimization.

The fundamental issue with all these proposals is the assumption that
you can re-compute the index entries at all. VACUUM has never, ever,
depended on the assumption that it can re-evaluate index entries and
get the same answers as the original insertion did. Now obviously
it should theoretically be able to do that, in a theoretical bug-free
world, but given that we allow user-defined functions in index
expressions that is a very hazardous assumption: you might get a
different answer. Or an error. The current VACUUM procedure is able
to clean up index entries correctly without any recalculation of the
index values, just matching of TIDs. I think we'd be taking a serious
robustness hit if we abandon that property.

This is basically the same objection that I have to the occasional
proposals for "retail VACUUM".

BTW, it's not merely a problem for functional indexes: the
datatype-specific functions invoked while searching an index are also
hazards.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 14:30:21
Message-ID: 8809.1159367421@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> That's probably more easily said than done --- in particular, I don't
>> understand what the committed state after the first transaction would
>> look like.

> I think you build a whole new index named something like ".temp-reindex" and
> then as the last step of the second transaction delete the old idnex and
> rename the new index.

That would require getting exclusive lock on the table.

regards, tom lane


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 14:40:06
Message-ID: 1159368006.3874.171.camel@coppola.muc.ecircle.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > I think you build a whole new index named something like ".temp-reindex" and
> > then as the last step of the second transaction delete the old idnex and
> > rename the new index.
>
> That would require getting exclusive lock on the table.

Just out of curiosity, creating a new index concurrently (or online,
whatever you call it) doesn't require to set an exclusive lock on the
table ? I thought it would, at least swiftly at the end of the
operation, after all it's modifying the table...

Cheers,
Csaba.


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 16:04:51
Message-ID: 20060927160451.GL19827@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 27, 2006 at 05:38:38AM -0400, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
> > Jim C. Nasby wrote:
> > > Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
> > > that anyway right now?
> >
> > You mean _index_ tuples marked dead? Sure, no problem there.
> >
> > > Granted, you'd want to periodically ensure that you scan the entire
> > > index, but that shouldn't be horribly hard to set up.
> >
> > Well, it seems to be. A vacuum can't evaluate index expressions because
> > it's not in a real transaction.
> >
> > The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0"
> > etc. with enable_seqscan=false? That would work, but we can't depend on
> > an additional administrative task like. And we might as well just
> > disable the optimization that's causing us problems.
>
> Why can't the C code just do a full index scan that touches the heap,
> sets those expired bits, and then do a vacuum? My point is that the
> bits can be set outside the normal vacuum process, so you don't have to
> be doing heap lookups from the index inside vacuum.
>
> Assuming the heap is mostly in index order, the full index scan
> shouldn't take much longer than a heap scan, and if the heap doesn't
> match index order, a block index shouldn't be used anyway.

Well, my thought was to have a backend process that would periodically
scan a small section of the index, so that you wouldn't have a
long-running transaction. That could then be followed by a vacuum of
that same section of the index, which would nuke the dead tuple entries.

Though, maybe we wouldn't even need the vacuum step since 8.2 will now
reclaim tuples marked as dead?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-27 18:18:19
Message-ID: 87hcyt8978.fsf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Also, now that we have concurrent CREATE INDEX, we could implement
>> concurrent REINDEX as well, I believe.
>
> That's probably more easily said than done --- in particular, I don't
> understand what the committed state after the first transaction would
> look like. CREATE INDEX can get away with it because nothing need be
> depending on the new index, but you can't say that for an existing index
> (esp. if it's UNIQUE).

I think you build a whole new index named something like ".temp-reindex" and
then as the last step of the second transaction delete the old idnex and
rename the new index.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-28 11:55:05
Message-ID: 451BB819.70705@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> AFAICS, we could disable the optimization and use a full-blown
>> transaction when vacuuming a table with a functional block index.
>
> No, we couldn't, or at least it's not merely a matter of reversing a
> recent optimization.
>
> The fundamental issue with all these proposals is the assumption that
> you can re-compute the index entries at all. VACUUM has never, ever,
> depended on the assumption that it can re-evaluate index entries and
> get the same answers as the original insertion did. Now obviously
> it should theoretically be able to do that, in a theoretical bug-free
> world, but given that we allow user-defined functions in index
> expressions that is a very hazardous assumption: you might get a
> different answer. Or an error. The current VACUUM procedure is able
> to clean up index entries correctly without any recalculation of the
> index values, just matching of TIDs. I think we'd be taking a serious
> robustness hit if we abandon that property.

I'm not worried about getting different results. If a used-defined
function behaves badly, you're queries are screwed anyway. But throwing
an error would be bad, because that would abort the whole vacuum.

If we want to keep the property that VACUUM doesn't re-evaluate index
entries, any proposal that doesn't keep track of every heap tuple isn't
going to work. I'll try to modify the design I had in mind so that it
does keep track of every heap tuple in some form.

> This is basically the same objection that I have to the occasional
> proposals for "retail VACUUM".

Yeah. :-(

> BTW, it's not merely a problem for functional indexes: the
> datatype-specific functions invoked while searching an index are also
> hazards.

Good point. I didn't realize that before.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 09:51:32
Message-ID: 451CECA4.90300@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> If we want to keep the property that VACUUM doesn't re-evaluate index
> entries, any proposal that doesn't keep track of every heap tuple
> isn't going to work. I'll try to modify the design I had in mind so
> that it does keep track of every heap tuple in some form.
After some thought:

Imagine a normal B-tree just like what we have now. But when there is
more than one tuple on the same heap page with consecutive index keys,
we represent all of them in a single index tuple that contains the key
of the first one of them, and a (run-length encoded) bitmap of the
OffsetNumbers. We should get most of the space and I/O savings as with
the original proposal, but we can vacuum it without re-evaluating index
expressions.

It does change the format of an index tuple, unlike the original
proposal. That adds some complexity. but it's doable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 10:13:05
Message-ID: 20060929101305.GC8702@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 29, 2006 at 10:51:32AM +0100, Heikki Linnakangas wrote:
> After some thought:
>
> Imagine a normal B-tree just like what we have now. But when there is
> more than one tuple on the same heap page with consecutive index keys,
> we represent all of them in a single index tuple that contains the key
> of the first one of them, and a (run-length encoded) bitmap of the
> OffsetNumbers. We should get most of the space and I/O savings as with
> the original proposal, but we can vacuum it without re-evaluating index
> expressions.

I think something like this has been discussed before. Where an index
tuple currently has the key values followed by a ctid, simply change
that so that it can be a list of ctid's, in order.

This works on having the same key, but doesn't care if the tuples are
on the same page. It used to be not possible because of how to handle
forward and backward scanning within an index entry during updates. I
think with the new "page at a time" index scanning, this got a lot
easier.

One issue is that a single index page could hold more than 1000 index
entries, which might cause problems for callers.

> It does change the format of an index tuple, unlike the original
> proposal. That adds some complexity. but it's doable.

This way doesn't change the current index format much.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 10:33:28
Message-ID: 451CF678.1040902@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Fri, Sep 29, 2006 at 10:51:32AM +0100, Heikki Linnakangas wrote:
>> After some thought:
>>
>> Imagine a normal B-tree just like what we have now. But when there is
>> more than one tuple on the same heap page with consecutive index keys,
>> we represent all of them in a single index tuple that contains the key
>> of the first one of them, and a (run-length encoded) bitmap of the
>> OffsetNumbers. We should get most of the space and I/O savings as with
>> the original proposal, but we can vacuum it without re-evaluating index
>> expressions.
>
> I think something like this has been discussed before. Where an index
> tuple currently has the key values followed by a ctid, simply change
> that so that it can be a list of ctid's, in order.

Actually it's t_tid followed by t_info (which is size + flags) followed
by key values.

> This works on having the same key, but doesn't care if the tuples are
> on the same page. It used to be not possible because of how to handle
> forward and backward scanning within an index entry during updates. I
> think with the new "page at a time" index scanning, this got a lot
> easier.

I'm not very interested in the case where you have a lot of equal keys,
I think the bitmap index am is more suitable for that. The Block B-tree
could be used whenever you have a clustered table (including unique
indexes). Some DBMSs have index-organized-tables for the same use case.

When I tested a prototype of the original idea with TPC-C (DBT-2) data,
a block index on the order_line table was under 2% of the size of a
normal B-tree. That's very close to a best-case scenario; narrow rows
and a completely clustered table. I'm aiming at that order of magnitude
reduction in storage size.

> One issue is that a single index page could hold more than 1000 index
> entries, which might cause problems for callers.

I don't think that's a problem.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 12:58:10
Message-ID: 1159534690.2767.348.camel@holly
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-09-29 at 10:51 +0100, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > If we want to keep the property that VACUUM doesn't re-evaluate index
> > entries, any proposal that doesn't keep track of every heap tuple
> > isn't going to work. I'll try to modify the design I had in mind so
> > that it does keep track of every heap tuple in some form.

The ideal situation is that we have one index pointer per block, so we
should look for that and optimize accordingly. We mark the heap block to
indicate how many block index pointers there are to it. If we have only
a single pointer, then VACUUM will only have to touch the index pointer
when the whole heap block is removed. In that case we have no
re-evaluation of the index AFAICS.

> After some thought:
>
> Imagine a normal B-tree just like what we have now. But when there is
> more than one tuple on the same heap page with consecutive index keys,
> we represent all of them in a single index tuple that contains the key
> of the first one of them, and a (run-length encoded) bitmap of the
> OffsetNumbers. We should get most of the space and I/O savings as with
> the original proposal, but we can vacuum it without re-evaluating index
> expressions.

The benefit we're seeking with a block index is that most INSERTs don't
write to the index. With that scheme we'd need to continually update the
index tuple so that it exactly represented the heap after each inserted
tuple, which is going to cause a hot block problem.

Much of that can go away if we have a bulk_heap_insert() which puts the
index entries in at the end of the block, though that needs some heavy
thought also.

Can we have this scheme enabled *only* for functional block indexes?
Normal case is likely to be monotonically ascending keys for real world
objects like Orders, Calls, Transactions etc.. It sounds like the
original proposal is still valid for those cases.

The bitmap would allow us to access heap rows faster in some
circumstances, I suppose.

Multi-block bitmaps do make this too much like bitmap indexes and the
use-case is very different. [Is there some kind of hybrid solution of
block & bitmap indexes?]

> It does change the format of an index tuple, unlike the original
> proposal. That adds some complexity. but it's doable.

Can we use an info bit to have two index tuple formats?
- single tuple (as now)
- multiple tuple block bitmap (as you suggest)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <gsstark(at)mit(dot)edu>
To: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc: postgres hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 12:59:09
Message-ID: 87ven63k2q.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Csaba Nagy <nagy(at)ecircle-ag(dot)com> writes:

> > > I think you build a whole new index named something like ".temp-reindex" and
> > > then as the last step of the second transaction delete the old idnex and
> > > rename the new index.
> >
> > That would require getting exclusive lock on the table.
>
> Just out of curiosity, creating a new index concurrently (or online,
> whatever you call it) doesn't require to set an exclusive lock on the
> table ? I thought it would, at least swiftly at the end of the
> operation, after all it's modifying the table...

Nope.

As I understand it the step that fundamentally requires a table lock is
actually dropping the old index. You have to be sure nobody is actually using
it before you do anything that causes people to stop maintaining it.

We could do something like how the online index build creates the index but in
reverse. We mark the index invalid and then wait out any transactions that
could be using it. Then we can drop it safely.

But I think even that has some practical problems. Transactions that have that
index in their relcache structure for the table will try to maintain it and
get confused if it's gone.

It seems to me that taking a brief lock on the table at the end of the reindex
isn't actually much of a problem. It only needs to be held briefly and it can
be done in a separate transaction so there isn't a deadlock risk.

--
greg


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 13:54:58
Message-ID: 451D25B2.9020208@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Fri, 2006-09-29 at 10:51 +0100, Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> If we want to keep the property that VACUUM doesn't re-evaluate index
>>> entries, any proposal that doesn't keep track of every heap tuple
>>> isn't going to work. I'll try to modify the design I had in mind so
>>> that it does keep track of every heap tuple in some form.
>
> The ideal situation is that we have one index pointer per block, so we
> should look for that and optimize accordingly. We mark the heap block to
> indicate how many block index pointers there are to it. If we have only
> a single pointer, then VACUUM will only have to touch the index pointer
> when the whole heap block is removed. In that case we have no
> re-evaluation of the index AFAICS.

I don't see how that would work. It sounds similar to the reference
counting option I proposed, which had the same re-evaluation problem.

And in addition, it requires adding index-specific information to the
heap page format, which troubles me from a modularization viewpoint.

>> After some thought:
>>
>> Imagine a normal B-tree just like what we have now. But when there is
>> more than one tuple on the same heap page with consecutive index keys,
>> we represent all of them in a single index tuple that contains the key
>> of the first one of them, and a (run-length encoded) bitmap of the
>> OffsetNumbers. We should get most of the space and I/O savings as with
>> the original proposal, but we can vacuum it without re-evaluating index
>> expressions.
>
> The benefit we're seeking with a block index is that most INSERTs don't
> write to the index. With that scheme we'd need to continually update the
> index tuple so that it exactly represented the heap after each inserted
> tuple, which is going to cause a hot block problem.

That's just one of the benefits. I think the main benefit is dramatic
reduction in index size which means that more of the index is cached.

An INSERT will have to find the corresponding leaf page anyway. Having
to dirty it isn't a big deal assuming that the hot blocks stay in cache.

The hot block problem worries me a bit too. Any indexing scheme that
packs more items on a block is going to suffer from that. Testing will
show if that becomes a problem.

> Can we have this scheme enabled *only* for functional block indexes?

No. As Tom pointed out, data type specific functions have potentially
the same problem.

And having both versions seems like a lot of code and complexity.

> The bitmap would allow us to access heap rows faster in some
> circumstances, I suppose.

Yes, you wouldn't have to re-evaluate index quals on every tuple, when
the whole range represented by the index tuple falls within the range
you're searching for. And when there's only few tuples with consecutive
keys on a heap page (which is not a good use case for block B-trees),
you don't need to scan the whole page to find those matches.

> Multi-block bitmaps do make this too much like bitmap indexes and the
> use-case is very different. [Is there some kind of hybrid solution of
> block & bitmap indexes?]

Not that I know of, though there is different kind of bitmap indexes.
The one that didn't make it to 8.2 uses equality encoding, where you
have a bitmap for every distinct value. You can also have
range-encoding, where you have a bitmap for ranges of values, for
example one bitmap for 1-10, another for 10-15 etc. If you choose the
ranges dynamically so that you have one range for each heap page (when
it's clustered), you get something similar to the proposed Block B-tree.

The current bitmap encoding scheme is optimized for large bitmaps,
though, so the performance wouldn't be as good.

>> It does change the format of an index tuple, unlike the original
>> proposal. That adds some complexity. but it's doable.
>
> Can we use an info bit to have two index tuple formats?
> - single tuple (as now)
> - multiple tuple block bitmap (as you suggest)

Yes. There's one bit free in the index tuple header.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 14:23:31
Message-ID: 15221.1159539811@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Imagine a normal B-tree just like what we have now. But when there is
> more than one tuple on the same heap page with consecutive index keys,
> we represent all of them in a single index tuple that contains the key
> of the first one of them, and a (run-length encoded) bitmap of the
> OffsetNumbers.

At first I thought that was a typo, and instead of "consecutive" you
meant to write "equal". I gather from the later statement

> I'm not very interested in the case where you have a lot of equal keys,
> I think the bitmap index am is more suitable for that.

that indeed you meant to write "consecutive", and I've got a problem
with that: define "consecutive". In a datatype independent fashion,
please. I also wonder how you are going to implement splitting and
merging of runs, which will certainly be necessary if this isn't to be
a constantly-requires-REINDEX thing.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 14:39:25
Message-ID: 1159540765.2767.359.camel@holly
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-09-29 at 14:54 +0100, Heikki Linnakangas wrote:

> > The benefit we're seeking with a block index is that most INSERTs don't
> > write to the index. With that scheme we'd need to continually update the
> > index tuple so that it exactly represented the heap after each inserted
> > tuple, which is going to cause a hot block problem.
>
> That's just one of the benefits. I think the main benefit is dramatic
> reduction in index size which means that more of the index is cached.
>
> An INSERT will have to find the corresponding leaf page anyway. Having
> to dirty it isn't a big deal assuming that the hot blocks stay in cache.

The index tuple would potentially grow in length while we update it, so
that means we'd need exclusive access to write, rather than shared
access to just read the index.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 14:55:20
Message-ID: 451D33D8.30708@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>
>> I'm not very interested in the case where you have a lot of equal keys,
>> I think the bitmap index am is more suitable for that.
>>
>
> that indeed you meant to write "consecutive", and I've got a problem
> with that: define "consecutive". In a datatype independent fashion,
> please. I also wonder how you are going to implement splitting and
> merging of runs, which will certainly be necessary if this isn't to be
> a constantly-requires-REINDEX thing.
>

I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in
"A and B are consecutive in the index, if there's no value X in the
index so that A < X < B". Maybe there's a better word for that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jan de Visser <jdevisser(at)digitalfairway(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 15:08:58
Message-ID: 200609291108.59303.jdevisser@digitalfairway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday 29 September 2006 10:55, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> >> I'm not very interested in the case where you have a lot of equal keys,
> >> I think the bitmap index am is more suitable for that.
> >
> > that indeed you meant to write "consecutive", and I've got a problem
> > with that: define "consecutive". In a datatype independent fashion,
> > please. I also wonder how you are going to implement splitting and
> > merging of runs, which will certainly be necessary if this isn't to be
> > a constantly-requires-REINDEX thing.
>
> I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in
> "A and B are consecutive in the index, if there's no value X in the
> index so that A < X < B". Maybe there's a better word for that.

http://en.wikipedia.org/wiki/Monotonic

jan

--
--------------------------------------------------------------
Jan de Visser                     jdevisser(at)digitalfairway(dot)com

                Baruk Khazad! Khazad ai-menu!
--------------------------------------------------------------


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 15:57:39
Message-ID: 16785.1159545459@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> define "consecutive".

> I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in
> "A and B are consecutive in the index, if there's no value X in the
> index so that A < X < B". Maybe there's a better word for that.

Um, but how are you going to make that work without storing the keys for
both A and B? You won't be able to tell whether an incoming key C
that's just a bit bigger than A should go before or after B.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 16:22:27
Message-ID: 451D4843.5050704@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>
>> I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in
>> "A and B are consecutive in the index, if there's no value X in the
>> index so that A < X < B". Maybe there's a better word for that.
>>
>
> Um, but how are you going to make that work without storing the keys for
> both A and B? You won't be able to tell whether an incoming key C
> that's just a bit bigger than A should go before or after B.
>

Let me describe the insertion algorithm:

1. To insert a tuple with key X, we find the position in the index where
the new tuple would go, just like with a normal B-tree. Let's call the
index tuple right before the position A and the next tuple B. So
according to normal B-tree rules, X should go between A and B.

2. If A points to the same heap page as X, we set the bit representing
the offset of the new tuple in the index tuple A (this might require
enlarging the index tuple and event splitting the page), and we're done.
If it points to a different page, we need split the range A-B to A-X-B,
proceed to step 3.

3. To split the range A-B, scan the heap page to see which of the tuples
pointed to by A are >= X and which are < X . If there's no tuples >= X,
insert a new index tuple for X, and we're done. Otherwise, let Y be the
smallest tuple >= X. Insert a new index tuple for Y, containing all the
offsets with keys >= X, and remove those offsets from A. We have now
split A-B to A-Y-B so that A < X < Y < B.

4. Insert the new index tuple for X.

(I'm not sure I got the above description correct for cases with equal
keys.)

Note that we don't keep track of the ordering of tuples that are clumped
into a single index tuple. That's not important, I/O wise, because if
you're going to fetch a heap page into memory, you might as well scan
all the tuples on it and sort them if necessary. That's where the space
and I/O savings comes from.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com