Quick Links

Maintaining cluster order on insert

Lists:	pgsql-hackerspgsql-patches

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Maintaining cluster order on insert
Date:	2006-08-09 19:04:17
Message-ID:	44DA31B1.3090700@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

While thinking about index-organized-tables and similar ideas, it
occurred to me that there's some low-hanging-fruit: maintaining cluster
order on inserts by trying to place new heap tuples close to other
similar tuples. That involves asking the index am where on the heap the
new tuple should go, and trying to insert it there before using the FSM.
Using the new fillfactor parameter makes it more likely that there's
room on the page. We don't worry about the order within the page.

The API I'm thinking of introduces a new optional index am function,
amsuggestblock (suggestions for a better name are welcome). It gets the
same parameters as aminsert, and returns the heap block number that
would be optimal place to put the new tuple. It's be called from
ExecInsert before inserting the heap tuple, and the suggestion is passed
on to heap_insert and RelationGetBufferForTuple.

I wrote a little patch to implement this for btree, attached.

This could be optimized by changing the existing aminsert API, because
as it is, an insert will have to descend the btree twice. Once in
amsuggestblock and then in aminsert. amsuggestblock could keep the right
index page pinned so aminsert could locate it quicker. But I wanted to
keep this simple for now. Another improvement might be to allow
amsuggestblock to return a list of suggestions, but that makes it more
expensive to insert if there isn't room in the suggested pages, since
heap_insert will have to try them all before giving up.

Comments regarding the general idea or the patch? There should probably
be a index option to turn the feature on and off. You'll want to turn it
off when you first load a table, and turn it on after CLUSTER to keep it
clustered.

Since there's been discussion on keeping the TODO list more up-to-date,
I hereby officially claim the "Automatically maintain clustering on a
table" TODO item :). Feel free to bombard me with requests for status
reports. And just to be clear, I'm not trying to sneak this into 8.2
anymore, this is 8.3 stuff.

I won't be implementing a background daemon described on the TODO item,
since that would essentially be an online version of CLUSTER. Which sure
would be nice, but that's a different story.

- Heikki

Attachment	Content-Type	Size
clustermaintenance.diff	text/x-patch	34.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Maintaining cluster order on insert
Date:	2006-08-09 20:17:02
Message-ID:	28914.1155154622@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> While thinking about index-organized-tables and similar ideas, it
> occurred to me that there's some low-hanging-fruit: maintaining cluster
> order on inserts by trying to place new heap tuples close to other
> similar tuples.

Doesn't this happen for free if there's enough space? UPDATE tries to
place the new tuple on the same page it's already on. In practice
people are only likely to cluster by primary keys (or other things that
seldom change) so I don't particularly agree with inventing a large wart
to support "optimally" placing things somewhere else...

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Maintaining cluster order on insert
Date:	2006-08-09 20:47:31
Message-ID:	36e682920608091347m4e075a8cv6a04e3c4d97ed6f4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 8/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> UPDATE tries to place the new tuple on the same page it's already
> on.

I think he meant for INSERT.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 2nd Floor | jharris(at)enterprisedb(dot)com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/

From:	Gene <genekhart(at)gmail(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Maintaining cluster order on insert
Date:	2006-08-10 01:31:49
Message-ID:	430d92a20608091831w6e59da84k48a58110cb98e8d6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I have a table that inserts lots of rows (million+ per day) int8 as primary
key, and I cluster by a timestamp which is approximately the timestamp of
the insert beforehand and is therefore in increasing order and doesn't
change. Most of the rows are updated about 3 times over time roughly within
the next 30 minutes. Should I assume that that all of these updates will be
on separate pages unless I perform a cluster (which takes a long time) and
performance will suffer due to this? Is it possible to preallocate space on
the same page for future updates (whatever the average number of updates may
be per row) decreasing the number of page loads for querying?

Gene

On 8/9/06, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com> wrote:
>
> On 8/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > UPDATE tries to place the new tuple on the same page it's already
> > on.
>
> I think he meant for INSERT.
>
> --
> Jonah H. Harris, Software Architect | phone: 732.331.1300
> EnterpriseDB Corporation | fax: 732.331.1301
> 33 Wood Ave S, 2nd Floor | jharris(at)enterprisedb(dot)com
> Iselin, New Jersey 08830 | http://www.enterprisedb.com/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>

--
Eugene Hart

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	genekhart(at)gmail(dot)com
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Maintaining cluster order on insert
Date:	2006-08-10 02:49:43
Message-ID:	9461.1155178183@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gene <genekhart(at)gmail(dot)com> writes:
> I have a table that inserts lots of rows (million+ per day) int8 as primary
> key, and I cluster by a timestamp which is approximately the timestamp of
> the insert beforehand and is therefore in increasing order and doesn't
> change. Most of the rows are updated about 3 times over time roughly within
> the next 30 minutes.

ISTM you should hardly need to worry about clustering that --- the data
will be in timestamp order pretty naturally.

The main problem you're going to have is the update-3-times bit. You
could keep updated rows on the same page as the original if you ran the
table at fillfactor 25% (which you'll be able to do in 8.2) ... but
while this might be sane for the leading edge of the table, you hardly
want such low storage density in the stable part.

You could reduce the fillfactor requirement if you could vacuum the
table constantly (every 10 minutes or so) but I assume the table is
large enough to make that unattractive. (Eventually we should have
a version of vacuum that understands where the dirty stuff is, which
might make this approach tenable ... but not in 8.2.)

Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.

regards, tom lane

From:	Gene <genekhart(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Maintaining cluster order on insert
Date:	2006-08-10 05:18:00
Message-ID:	430d92a20608092218u7f44f8baob801aea8b136e5dd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

You are correct the main part I'm worried about is the updates, being so far
from the originals. fyi I am partitioning the tables by the timestamp
column,vacuum analyzing once per hour, creating one child partition per day
in a cron job. Because I'm using hibernate for database abstraction
(stateless sessions), I can only have one RULE since having more than one
insert rule will not return the correct number of updated rows which
confuses hibernate. The one rule just directs inserts to the latest child
partition which seems to work well.

The reason I'm doing the clustering is I was hoping that with the "stable"
non-updating partitions I could execute a CLUSTER at night (slow...) and it
would compact the tables into their most efficient state for querying which
always involves a date range. bad idea? In this fillfactor feature, will you
be able to set it to 100% once you know that no more updates will occur? Or
will doing a cluster effectively do this? Will the fill factor only apply
for inserts?

"Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care."

This sounds interesting, I could create a RULE/INSERT on the unstable table,
I will know during the update if it is ready to be put in the stable table.
What would be an efficient way to do the transfer? Since the updates occur
somewhat randomly, wouldnt the tuples in the stable table then be out of
natural timestamp order?

thanks for all of your help and comments! it is greatly appreciated!
Gene Hart

On 8/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Gene <genekhart(at)gmail(dot)com> writes:
> > I have a table that inserts lots of rows (million+ per day) int8 as
> primary
> > key, and I cluster by a timestamp which is approximately the timestamp
> of
> > the insert beforehand and is therefore in increasing order and doesn't
> > change. Most of the rows are updated about 3 times over time roughly
> within
> > the next 30 minutes.
>
> ISTM you should hardly need to worry about clustering that --- the data
> will be in timestamp order pretty naturally.
>
> The main problem you're going to have is the update-3-times bit. You
> could keep updated rows on the same page as the original if you ran the
> table at fillfactor 25% (which you'll be able to do in 8.2) ... but
> while this might be sane for the leading edge of the table, you hardly
> want such low storage density in the stable part.
>
> You could reduce the fillfactor requirement if you could vacuum the
> table constantly (every 10 minutes or so) but I assume the table is
> large enough to make that unattractive. (Eventually we should have
> a version of vacuum that understands where the dirty stuff is, which
> might make this approach tenable ... but not in 8.2.)
>
> Your best bet might be to partition the table into two subtables, one
> with "stable" data and one with the fresh data, and transfer rows from
> one to the other once they get stable. Storage density in the "fresh"
> part would be poor, but it should be small enough you don't care.
>
> regards, tom lane
>

--
Eugene Hart

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Maintaining cluster order on insert
Date:	2006-08-10 08:11:08
Message-ID:	44DAEA1C.5050509@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jonah H. Harris wrote:
> On 8/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> UPDATE tries to place the new tuple on the same page it's already
>> on.
>
> I think he meant for INSERT.
>

Right. Update is indeed taken care of already.

One example where this would help would be a customer_history table that
stores all transactions of a customer. Primary key is (customer_id,
transaction_id). You would want to keep the table clustered by
customer_id to make it quick to retrieve all transactions of a customer.
In general, any table with more or less random insert/delete activity
that you want to keep in order would benefit.

- Heikki

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	genekhart(at)gmail(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, jonah(dot)harris(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] Maintaining cluster order on insert
Date:	2006-08-10 08:23:39
Message-ID:	44DAED0B.8010007@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gene wrote:
> You are correct the main part I'm worried about is the updates, being
> so far from the originals.

Yeah, you won't benefit from the patch at all.
> The reason I'm doing the clustering is I was hoping that with the
> "stable" non-updating partitions I could execute a CLUSTER at night
> (slow...) and it would compact the tables into their most efficient
> state for querying which always involves a date range. bad idea? In
> this fillfactor feature, will you be able to set it to 100% once you
> know that no more updates will occur? Or will doing a cluster
> effectively do this? Will the fill factor only apply for inserts?

That sounds like a good approach. CLUSTER obeys the fillfactor, so
you'll want to set it to 100 for the older partitions before you CLUSTER.

You might want to experiment with the fillfactor. You might get the best
performance if you just set it to 100 even for the latest partition, if
your queries usually have to scan most of it anyway. Fillfactor 100 will
help to keep it dense and in memory, so it won't matter so much if it's
disorganized.
> "Your best bet might be to partition the table into two subtables, one
> with "stable" data and one with the fresh data, and transfer rows from
> one to the other once they get stable. Storage density in the "fresh"
> part would be poor, but it should be small enough you don't care."
>
> This sounds interesting, I could create a RULE/INSERT on the unstable
> table, I will know during the update if it is ready to be put in the
> stable table. What would be an efficient way to do the transfer? Since
> the updates occur somewhat randomly, wouldnt the tuples in the stable
> table then be out of natural timestamp order?
I'm not sure I understand the last sentence. I thought the updates
usually occur within 30 minutes of the insert. So if you transfer the
rows to the stable table after 30 minutes, there won't be updates to the
stable table.

- Heikki

From:	stark <stark(at)enterprisedb(dot)com>
To:	genekhart(at)gmail(dot)com
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Maintaining cluster order on insert
Date:	2006-08-10 10:59:11
Message-ID:	87d5b87t6o.fsf@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gene <genekhart(at)gmail(dot)com> writes:

> "Your best bet might be to partition the table into two subtables, one
> with "stable" data and one with the fresh data, and transfer rows from
> one to the other once they get stable. Storage density in the "fresh"
> part would be poor, but it should be small enough you don't care."
>
> This sounds interesting, I could create a RULE/INSERT on the unstable table,
> I will know during the update if it is ready to be put in the stable table.
> What would be an efficient way to do the transfer? Since the updates occur
> somewhat randomly, wouldnt the tuples in the stable table then be out of
> natural timestamp order?

You may find it easier to handle some of the logic in a low level application
layer or layer of stored procedures rather than trying to make it entirely
transparent with rules. If you do want it to be transparent you might also
consider whether you want triggers instead of rules.

Another direction you might want to consider is whether the columns that
you're updating would be more normalized in a separate table. You might really
want to have a record of those past states as well. So you might find having
three records in this other table for each of your regular records in your
main table might actually work out better.

Even if you only have a 1-1 relationship sometimes this kind of horizontal
partitioning (or do people consider this vertical partitioning?) is still
worthwhile. If the columns being updated are very small or often not needed at
all then it may be reasonably efficient to look them up separately and still
let you store the bulk of the data efficiently and access it in a fast
sequential scan.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Maintaining cluster order on insert
Date:	2006-08-11 20:36:42
Message-ID:	ebipoh$1538$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> Gene <genekhart(at)gmail(dot)com> writes:
>> I have a table that inserts lots of rows (million+ per day) int8 as primary
>> key, and I cluster by a timestamp which is approximately the timestamp of
>> the insert...
>
> ISTM you should hardly need to worry about clustering that --- the data
> will be in timestamp order pretty naturally.

In my case my biggest/slowest tables are clustered by zip-code (which
does a reasonable job at keeping counties/cities/etc on the
same pages too). Data comes in constantly (many records per minute, as
we ramp up), pretty uniformly across the country; but most queries
are geographically bounded. The data's pretty much insert-only.

If I understand Heikki's patch, it would help for this use case.

> Your best bet might be to partition the table into two subtables, one
> with "stable" data and one with the fresh data, and transfer rows from
> one to the other once they get stable. Storage density in the "fresh"
> part would be poor, but it should be small enough you don't care.

Hmm... that should work well for me too. Not sure if the use-case
I mentioned above is still compelling anymore; since this seems like
it'd give me much of the benefit; and I don't need an excessive
fillfactor on the stable part of the table.

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Maintaining cluster order on insert
Date:	2006-08-11 23:16:58
Message-ID:	44DD0FEA.8070002@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ron Mayer wrote:
> In my case my biggest/slowest tables are clustered by zip-code (which
> does a reasonable job at keeping counties/cities/etc on the
> same pages too). Data comes in constantly (many records per minute, as
> we ramp up), pretty uniformly across the country; but most queries
> are geographically bounded. The data's pretty much insert-only.

No deletes? If the tables grow over time, you probably would need to run
CLUSTER every now and then to get the best performance, though the patch
would alleviate that quite a lot.

Do you have a development environment where you could test what effect
the patch would have? It would be interesting to have a real-world use
case, since I don't have one myself at the moment.

> If I understand Heikki's patch, it would help for this use case.

Yes, it would.

> > Your best bet might be to partition the table into two subtables, one
> > with "stable" data and one with the fresh data, and transfer rows from
> > one to the other once they get stable. Storage density in the "fresh"
> > part would be poor, but it should be small enough you don't care.
>
> Hmm... that should work well for me too. Not sure if the use-case
> I mentioned above is still compelling anymore; since this seems like
> it'd give me much of the benefit; and I don't need an excessive
> fillfactor on the stable part of the table.

Umm, if your inserts are uniformly distributed across the country, you
wouldn't have a stable part, right?

- Heikki

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Maintaining cluster order on insert
Date:	2006-08-11 23:33:26
Message-ID:	44DD13C6.2020609@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Ron Mayer wrote:
>> In my case my biggest/slowest tables are clustered by zip-code (which
>> does a reasonable job at keeping counties/cities/etc on the
>> same pages too)....
>
> No deletes? If the tables grow over time, you probably would need to run
> CLUSTER every now and then to get the best performance, though the patch
> would alleviate that quite a lot.

Yup, pretty much no deletes; since it's a historical archive of
some government documents with address info. Though I the
live system may periodically expunge data, say, 10+years old.

> Do you have a development environment where you could test what effect
> the patch would have? It would be interesting to have a real-world use
> case, since I don't have one myself at the moment.

I have a development environment, but it doesn't have the same
real-time-growing behavior, and only a small region of the country.
I suppose I could pre-load N-1 years and cluster it, and then
incrementally insert the last year of data to simulate the effect.

But sure, I'll attempt to try the patch; but don't really have any
good benchmarking environment to give any definitive results. If
an anecdotal "this is how it feels to me" is useful, I can give one
of those.

>> > Your best bet might be to partition the table into two subtables, one
>> > with "stable" data and one with the fresh data.
>>
>> Hmm... that should work well for me too....
>
> Umm, if your inserts are uniformly distributed across the country, you
> wouldn't have a stable part, right?

Hmm. Maybe. I was thinking when archiving to the large table
an "order by" clause when inserting from the new partition to the
stable partition could at least make the big table "piecewise"
clustered so most records for a zip code fit in the same few disk
pages, even though those pages would still end up lying around
far apart on the disk.

I wonder what part of "CLUSTER" gives the most benefit - that
most records of a type fit on a few blocks; or that those blocks
are next to each other so can be read sequentially?

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-15 22:23:19
Message-ID:	200705152223.l4FMNJa17969@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

[ Sorry I found this one only found recently.]

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> While thinking about index-organized-tables and similar ideas, it
> occurred to me that there's some low-hanging-fruit: maintaining cluster
> order on inserts by trying to place new heap tuples close to other
> similar tuples. That involves asking the index am where on the heap the
> new tuple should go, and trying to insert it there before using the FSM.
> Using the new fillfactor parameter makes it more likely that there's
> room on the page. We don't worry about the order within the page.
>
> The API I'm thinking of introduces a new optional index am function,
> amsuggestblock (suggestions for a better name are welcome). It gets the
> same parameters as aminsert, and returns the heap block number that
> would be optimal place to put the new tuple. It's be called from
> ExecInsert before inserting the heap tuple, and the suggestion is passed
> on to heap_insert and RelationGetBufferForTuple.
>
> I wrote a little patch to implement this for btree, attached.
>
> This could be optimized by changing the existing aminsert API, because
> as it is, an insert will have to descend the btree twice. Once in
> amsuggestblock and then in aminsert. amsuggestblock could keep the right
> index page pinned so aminsert could locate it quicker. But I wanted to
> keep this simple for now. Another improvement might be to allow
> amsuggestblock to return a list of suggestions, but that makes it more
> expensive to insert if there isn't room in the suggested pages, since
> heap_insert will have to try them all before giving up.
>
> Comments regarding the general idea or the patch? There should probably
> be a index option to turn the feature on and off. You'll want to turn it
> off when you first load a table, and turn it on after CLUSTER to keep it
> clustered.
>
> Since there's been discussion on keeping the TODO list more up-to-date,
> I hereby officially claim the "Automatically maintain clustering on a
> table" TODO item :). Feel free to bombard me with requests for status
> reports. And just to be clear, I'm not trying to sneak this into 8.2
> anymore, this is 8.3 stuff.
>
> I won't be implementing a background daemon described on the TODO item,
> since that would essentially be an online version of CLUSTER. Which sure
> would be nice, but that's a different story.
>
> - Heikki
>

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-15 22:26:51
Message-ID:	464A33AB.1050401@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ah, thanks! I had forgotten about it as well.

Bruce Momjian wrote:
> [ Sorry I found this one only found recently.]
>
> Your patch has been added to the PostgreSQL unapplied patches list at:
>
> http://momjian.postgresql.org/cgi-bin/pgpatches
>
> It will be applied as soon as one of the PostgreSQL committers reviews
> and approves it.
>
> ---------------------------------------------------------------------------
>
>
> Heikki Linnakangas wrote:
>> While thinking about index-organized-tables and similar ideas, it
>> occurred to me that there's some low-hanging-fruit: maintaining cluster
>> order on inserts by trying to place new heap tuples close to other
>> similar tuples. That involves asking the index am where on the heap the
>> new tuple should go, and trying to insert it there before using the FSM.
>> Using the new fillfactor parameter makes it more likely that there's
>> room on the page. We don't worry about the order within the page.
>>
>> The API I'm thinking of introduces a new optional index am function,
>> amsuggestblock (suggestions for a better name are welcome). It gets the
>> same parameters as aminsert, and returns the heap block number that
>> would be optimal place to put the new tuple. It's be called from
>> ExecInsert before inserting the heap tuple, and the suggestion is passed
>> on to heap_insert and RelationGetBufferForTuple.
>>
>> I wrote a little patch to implement this for btree, attached.
>>
>> This could be optimized by changing the existing aminsert API, because
>> as it is, an insert will have to descend the btree twice. Once in
>> amsuggestblock and then in aminsert. amsuggestblock could keep the right
>> index page pinned so aminsert could locate it quicker. But I wanted to
>> keep this simple for now. Another improvement might be to allow
>> amsuggestblock to return a list of suggestions, but that makes it more
>> expensive to insert if there isn't room in the suggested pages, since
>> heap_insert will have to try them all before giving up.
>>
>> Comments regarding the general idea or the patch? There should probably
>> be a index option to turn the feature on and off. You'll want to turn it
>> off when you first load a table, and turn it on after CLUSTER to keep it
>> clustered.
>>
>> Since there's been discussion on keeping the TODO list more up-to-date,
>> I hereby officially claim the "Automatically maintain clustering on a
>> table" TODO item :). Feel free to bombard me with requests for status
>> reports. And just to be clear, I'm not trying to sneak this into 8.2
>> anymore, this is 8.3 stuff.
>>
>> I won't be implementing a background daemon described on the TODO item,
>> since that would essentially be an online version of CLUSTER. Which sure
>> would be nice, but that's a different story.
>>
>> - Heikki
>>
>
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-16 00:38:10
Message-ID:	20070516003810.GI20707@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could at least
try and insert close to the page you originally wanted.

On Tue, May 15, 2007 at 11:26:51PM +0100, Heikki Linnakangas wrote:
> Ah, thanks! I had forgotten about it as well.
>
> Bruce Momjian wrote:
> >[ Sorry I found this one only found recently.]
> >
> >Your patch has been added to the PostgreSQL unapplied patches list at:
> >
> > http://momjian.postgresql.org/cgi-bin/pgpatches
> >
> >It will be applied as soon as one of the PostgreSQL committers reviews
> >and approves it.
> >
> >---------------------------------------------------------------------------
> >
> >
> >Heikki Linnakangas wrote:
> >>While thinking about index-organized-tables and similar ideas, it
> >>occurred to me that there's some low-hanging-fruit: maintaining cluster
> >>order on inserts by trying to place new heap tuples close to other
> >>similar tuples. That involves asking the index am where on the heap the
> >>new tuple should go, and trying to insert it there before using the FSM.
> >>Using the new fillfactor parameter makes it more likely that there's
> >>room on the page. We don't worry about the order within the page.
> >>
> >>The API I'm thinking of introduces a new optional index am function,
> >>amsuggestblock (suggestions for a better name are welcome). It gets the
> >>same parameters as aminsert, and returns the heap block number that
> >>would be optimal place to put the new tuple. It's be called from
> >>ExecInsert before inserting the heap tuple, and the suggestion is passed
> >>on to heap_insert and RelationGetBufferForTuple.
> >>
> >>I wrote a little patch to implement this for btree, attached.
> >>
> >>This could be optimized by changing the existing aminsert API, because
> >>as it is, an insert will have to descend the btree twice. Once in
> >>amsuggestblock and then in aminsert. amsuggestblock could keep the right
> >>index page pinned so aminsert could locate it quicker. But I wanted to
> >>keep this simple for now. Another improvement might be to allow
> >>amsuggestblock to return a list of suggestions, but that makes it more
> >>expensive to insert if there isn't room in the suggested pages, since
> >>heap_insert will have to try them all before giving up.
> >>
> >>Comments regarding the general idea or the patch? There should probably
> >>be a index option to turn the feature on and off. You'll want to turn it
> >>off when you first load a table, and turn it on after CLUSTER to keep it
> >>clustered.
> >>
> >>Since there's been discussion on keeping the TODO list more up-to-date,
> >>I hereby officially claim the "Automatically maintain clustering on a
> >>table" TODO item :). Feel free to bombard me with requests for status
> >>reports. And just to be clear, I'm not trying to sneak this into 8.2
> >>anymore, this is 8.3 stuff.
> >>
> >>I won't be implementing a background daemon described on the TODO item,
> >>since that would essentially be an online version of CLUSTER. Which sure
> >>would be nice, but that's a different story.
> >>
> >>- Heikki
> >>
> >
> >
>
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

--
Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-16 14:32:15
Message-ID:	464B15EF.4000808@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jim C. Nasby wrote:
> What about adding the ability to ask the FSM for a page that's near a
> given page? That way if you did have to go to the FSM you could at least
> try and insert close to the page you originally wanted.

Yeah, there's always room for improvement. I made the patch when I was
working on clustered indexes, and was mostly concerned about getting
inserts to the same page as other tuples with similar values so that the
clustered index stays clustered.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Jaime Casanova" <systemguards(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-18 04:53:52
Message-ID:	c2d9e70e0705172153g3207adb6kd4abdfdf3a9215ff@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 5/16/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> Jim C. Nasby wrote:
> > What about adding the ability to ask the FSM for a page that's near a
> > given page? That way if you did have to go to the FSM you could at least
> > try and insert close to the page you originally wanted.
>
> Yeah, there's always room for improvement. I made the patch when I was
> working on clustered indexes, and was mostly concerned about getting
> inserts to the same page as other tuples with similar values so that the
> clustered index stays clustered.
>

the patch doesn't apply in cvs... you'll need to update it...

--
regards,
Jaime Casanova

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
Richard Cook

From:	"Jaime Casanova" <systemguards(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-18 04:54:34
Message-ID:	c2d9e70e0705172154y4dbe17d5r4bb38c0f887542e9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 5/17/07, Jaime Casanova <systemguards(at)gmail(dot)com> wrote:
> On 5/16/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> > Jim C. Nasby wrote:
> > > What about adding the ability to ask the FSM for a page that's near a
> > > given page? That way if you did have to go to the FSM you could at least
> > > try and insert close to the page you originally wanted.
> >
> > Yeah, there's always room for improvement. I made the patch when I was
> > working on clustered indexes, and was mostly concerned about getting
> > inserts to the same page as other tuples with similar values so that the
> > clustered index stays clustered.
> >
>
> the patch doesn't apply in cvs... you'll need to update it...
>

cvs i wrote? head i was meaning... sorry, too late for me =)

--
regards,
Jaime Casanova

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jaime Casanova <systemguards(at)gmail(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-18 12:06:46
Message-ID:	464D96D6.1030608@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jaime Casanova wrote:
> On 5/16/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>> Jim C. Nasby wrote:
>> > What about adding the ability to ask the FSM for a page that's near a
>> > given page? That way if you did have to go to the FSM you could at
>> least
>> > try and insert close to the page you originally wanted.
>>
>> Yeah, there's always room for improvement. I made the patch when I was
>> working on clustered indexes, and was mostly concerned about getting
>> inserts to the same page as other tuples with similar values so that the
>> clustered index stays clustered.
>>
>
> the patch doesn't apply in cvs... you'll need to update it...

Oh, here you are.

The implementation has changed a bit since August. I thought I had
submitted an updated version in the winter but couldn't find it. Anyway,
I updated and dusted off the source tree, tidied up the comments a
little bit, and fixed some inconsistencies in pg_proc entries that made
opr_sanity to fail.

The beef of the patch is two new optional indexam API functions:
amprepareinsert and amfinishinsert. amprepareinsert is called before
inserting the heap tuple. It descends the tree and finds and pins the
right leaf page to insert to, and returns a suggestion on where the heap
tuple should be inserted. amfinishinsert is called after inserting the
heap tuple to actually insert the index tuple. Documentation for these
functions need to be added indexam.sgml, I noticed that that's not done yet.

The cluster_inserts GUC option that you can use to enable/disable the
feature should be removed before committing.

The performance characteristics of this patch hasn't been thoroughly
discussed yet. The reason why you want to cluster your tables is to
speed up SELECTs that return a bunch of tuples with similar values, for
example range queries. The reason for keeping them clustered on inserts
is to reduce the need to run CLUSTER as often.

It doesn't come without a cost, however. In the worst case, there never
is room for new inserts on pages, and each insert needs to do one extra
I/O to fetch the optimal heap page where the insert should go, see that
there's no room, and then insert somewhere else. Using a non-zero
fillfactor helps, but even when there is room on the page, it's often
cheaper to just append to the end of the table and running CLUSTER at
night for example, than do random access to insert to the "right" pages
in the heap.

So, should we have a WITH-option on the table to enable/disable this
feature, and what would be the default?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
maintain_cluster_order_v7.patch	text/x-diff	34.4 KB

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-18 13:54:59
Message-ID:	200705181354.l4IDsxp16082@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Jaime Casanova wrote:
> > On 5/16/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >> Jim C. Nasby wrote:
> >> > What about adding the ability to ask the FSM for a page that's near a
> >> > given page? That way if you did have to go to the FSM you could at
> >> least
> >> > try and insert close to the page you originally wanted.
> >>
> >> Yeah, there's always room for improvement. I made the patch when I was
> >> working on clustered indexes, and was mostly concerned about getting
> >> inserts to the same page as other tuples with similar values so that the
> >> clustered index stays clustered.
> >>
> >
> > the patch doesn't apply in cvs... you'll need to update it...
>
> Oh, here you are.
>
> The implementation has changed a bit since August. I thought I had
> submitted an updated version in the winter but couldn't find it. Anyway,
> I updated and dusted off the source tree, tidied up the comments a
> little bit, and fixed some inconsistencies in pg_proc entries that made
> opr_sanity to fail.
>
> The beef of the patch is two new optional indexam API functions:
> amprepareinsert and amfinishinsert. amprepareinsert is called before
> inserting the heap tuple. It descends the tree and finds and pins the
> right leaf page to insert to, and returns a suggestion on where the heap
> tuple should be inserted. amfinishinsert is called after inserting the
> heap tuple to actually insert the index tuple. Documentation for these
> functions need to be added indexam.sgml, I noticed that that's not done yet.
>
> The cluster_inserts GUC option that you can use to enable/disable the
> feature should be removed before committing.
>
> The performance characteristics of this patch hasn't been thoroughly
> discussed yet. The reason why you want to cluster your tables is to
> speed up SELECTs that return a bunch of tuples with similar values, for
> example range queries. The reason for keeping them clustered on inserts
> is to reduce the need to run CLUSTER as often.
>
> It doesn't come without a cost, however. In the worst case, there never
> is room for new inserts on pages, and each insert needs to do one extra
> I/O to fetch the optimal heap page where the insert should go, see that
> there's no room, and then insert somewhere else. Using a non-zero
> fillfactor helps, but even when there is room on the page, it's often
> cheaper to just append to the end of the table and running CLUSTER at
> night for example, than do random access to insert to the "right" pages
> in the heap.
>
> So, should we have a WITH-option on the table to enable/disable this
> feature, and what would be the default?
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-18 15:08:26
Message-ID:	27526.1179500906@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> The beef of the patch is two new optional indexam API functions:
> amprepareinsert and amfinishinsert. amprepareinsert is called before
> inserting the heap tuple. It descends the tree and finds and pins the
> right leaf page to insert to, and returns a suggestion on where the heap
> tuple should be inserted. amfinishinsert is called after inserting the
> heap tuple to actually insert the index tuple. Documentation for these
> functions need to be added indexam.sgml, I noticed that that's not done yet.

What happens when there's more than one index?

Is there a risk of deadlock during concurrent insertions (from different
processes trying to lock the same buffers in different orders)?

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-18 15:19:18
Message-ID:	464DC3F6.502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> The beef of the patch is two new optional indexam API functions:
>> amprepareinsert and amfinishinsert. amprepareinsert is called before
>> inserting the heap tuple. It descends the tree and finds and pins the
>> right leaf page to insert to, and returns a suggestion on where the heap
>> tuple should be inserted. amfinishinsert is called after inserting the
>> heap tuple to actually insert the index tuple. Documentation for these
>> functions need to be added indexam.sgml, I noticed that that's not done yet.
>
> What happens when there's more than one index?

You can only cluster a table on one index. The other index inserts will
use the aminsert-function, after inserting the heap tuple as usual.

> Is there a risk of deadlock during concurrent insertions (from different
> processes trying to lock the same buffers in different orders)?

No, should be ok. btprepareinsert function will release locks (but not
the pin) after descending the tree, and btfinishinsert will reacquire
them. The locking order is the same as today.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Jaime Casanova" <systemguards(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-19 05:05:14
Message-ID:	c2d9e70e0705182205o44a4bbbft18a73b9de91de726@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 5/18/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> Jaime Casanova wrote:
> >
> > the patch doesn't apply in cvs... you'll need to update it...
>
> Oh, here you are.
>
> The implementation has changed a bit since August. I thought I had
> submitted an updated version in the winter but couldn't find it. Anyway,
> I updated and dusted off the source tree, tidied up the comments a
> little bit, and fixed some inconsistencies in pg_proc entries that made
> opr_sanity to fail.
>

this one doesn't apply either... there are problems with nbtinsert.c and pg_am.h

--
regards,
Jaime Casanova

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jaime Casanova <systemguards(at)gmail(dot)com>
Cc:	PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-19 18:25:16
Message-ID:	464F410C.8090900@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jaime Casanova wrote:
> On 5/18/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>> Jaime Casanova wrote:
>> >
>> > the patch doesn't apply in cvs... you'll need to update it...
>>
>> Oh, here you are.
>>
>> The implementation has changed a bit since August. I thought I had
>> submitted an updated version in the winter but couldn't find it. Anyway,
>> I updated and dusted off the source tree, tidied up the comments a
>> little bit, and fixed some inconsistencies in pg_proc entries that made
>> opr_sanity to fail.
>>
>
> this one doesn't apply either... there are problems with nbtinsert.c and
> pg_am.h

Ah, sorry about that. For some reason my source tree was checked out
from the 8.2 branch, instead of CVS HEAD.

Here you are. Thanks for looking at this!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
maintain_cluster_order_v8.patch	text/x-diff	37.2 KB

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-19 19:13:05
Message-ID:	200705191913.l4JJD5R09020@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Jaime Casanova wrote:
> > On 5/18/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >> Jaime Casanova wrote:
> >> >
> >> > the patch doesn't apply in cvs... you'll need to update it...
> >>
> >> Oh, here you are.
> >>
> >> The implementation has changed a bit since August. I thought I had
> >> submitted an updated version in the winter but couldn't find it. Anyway,
> >> I updated and dusted off the source tree, tidied up the comments a
> >> little bit, and fixed some inconsistencies in pg_proc entries that made
> >> opr_sanity to fail.
> >>
> >
> > this one doesn't apply either... there are problems with nbtinsert.c and
> > pg_am.h
>
> Ah, sorry about that. For some reason my source tree was checked out
> from the 8.2 branch, instead of CVS HEAD.
>
> Here you are. Thanks for looking at this!
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Jaime Casanova" <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-21 08:17:06
Message-ID:	2e78013d0705210117y6a65245cw426bbaafa9eb6419@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 5/19/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
>
> Ah, sorry about that. For some reason my source tree was checked out
> from the 8.2 branch, instead of CVS HEAD.
>
>
I looked at the patch. Not that I am very comfortable with this part
of the code, but nevertheless here are my comments:

I am wondering if we can optimize _bt_moveright() code by remembering
the PageLSN of the page before releasing the READ LOCK. If the page
is indeed changed since we last saw it, the PageLSN must be different
than what we had remembered and gives us a hint that we might need
to move right.

! /* Page was full. Release lock and pin and get another block
! * as if suggested_blk was not given.
! */
! LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK);
! ReleaseBuffer(suggested_buf);

May be you want to use UnlockReleaseBuffer() or _bt_relbuf() just as a
shorthand.
Btw, I wonder why _bt_relbuf() exists and even so why does it take
"Relation"
argument ?

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Jaime Casanova" <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-21 09:32:18
Message-ID:	2e78013d0705210232m5d031a6fk779be6d7df5a2e3c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 5/21/07, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>
>
> On 5/19/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com > wrote:
> >
> >
> > Ah, sorry about that. For some reason my source tree was checked out
> > from the 8.2 branch, instead of CVS HEAD.
> >
> >
> I looked at the patch. Not that I am very comfortable with this part
> of the code, but nevertheless here are my comments:
>
>
Another problem that I noticed with the patch is that it disregards
the fillfactor while inserting the tuples. ISTM that this is a correct
behavior when a tuple is being inserted in the block to preserve cluster
ordering. But when the tuple is being inserted as a normal operation,
IMO we should respect the fillfactor.

The easiest way to reproduce this is to insert tuples sorted on the
cluster index key.

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-21 09:48:59
Message-ID:	46516B0B.3000907@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Pavan Deolasee wrote:
> On 5/21/07, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>>
>>
>>
>> On 5/19/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com > wrote:
>> >
>> >
>> > Ah, sorry about that. For some reason my source tree was checked out
>> > from the 8.2 branch, instead of CVS HEAD.
>> >
>> >
>> I looked at the patch. Not that I am very comfortable with this part
>> of the code, but nevertheless here are my comments:
>>
>>
> Another problem that I noticed with the patch is that it disregards
> the fillfactor while inserting the tuples. ISTM that this is a correct
> behavior when a tuple is being inserted in the block to preserve cluster
> ordering. But when the tuple is being inserted as a normal operation,
> IMO we should respect the fillfactor.
>
> The easiest way to reproduce this is to insert tuples sorted on the
> cluster index key.

Actually it does respect the fillfactor when the block suggested by
indexam is full. When you insert tuples sorted on the cluster index key,
the suggested page after the first tuple on the new page is always the
last page. Try inserting in random order instead.

IOW it's working as designed. But maybe it's not the desired behavior.
Should we have a special case and always respect the fillfactor when
inserting to the last page of the heap?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Jaime Casanova <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-05-27 19:01:36
Message-ID:	20070527190136.GR92628@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote:
> >Another problem that I noticed with the patch is that it disregards
> >the fillfactor while inserting the tuples. ISTM that this is a correct
> >behavior when a tuple is being inserted in the block to preserve cluster
> >ordering. But when the tuple is being inserted as a normal operation,
> >IMO we should respect the fillfactor.
> >
> >The easiest way to reproduce this is to insert tuples sorted on the
> >cluster index key.
>
> Actually it does respect the fillfactor when the block suggested by
> indexam is full. When you insert tuples sorted on the cluster index key,
> the suggested page after the first tuple on the new page is always the
> last page. Try inserting in random order instead.
>
> IOW it's working as designed. But maybe it's not the desired behavior.
> Should we have a special case and always respect the fillfactor when
> inserting to the last page of the heap?

I think that would be following with least surprise.
--
Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	"Jaime Casanova" <systemguards(at)gmail(dot)com>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-16 04:48:46
Message-ID:	c2d9e70e0706152148h7824eaafh499ba536eec82149@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 5/27/07, Jim C. Nasby <decibel(at)decibel(dot)org> wrote:
> On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote:
> >
> > IOW it's working as designed. But maybe it's not the desired behavior.
> > Should we have a special case and always respect the fillfactor when
> > inserting to the last page of the heap?
>
> I think that would be following with least surprise.

What's the status of this patch? are we waiting an update?

AFAIU, it's not fair to say that the patch maintain cluster order...
it just try to keep similar rows on the same page if possible (it's
not the same thing)... if it can't then it simply insert at the last
page as usual but we have wasted time in the try...

so the real question is if there is any performance win on this...
have you some numbers?

another question: if the fillfactor is 100% then is a complete waste
of time to look for a suggested block. maybe we could check for that?

--
regards,
Jaime Casanova

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jaime Casanova" <systemguards(at)gmail(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-16 14:56:24
Message-ID:	11028.1182005784@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Jaime Casanova" <systemguards(at)gmail(dot)com> writes:
> another question: if the fillfactor is 100% then is a complete waste
> of time to look for a suggested block. maybe we could check for that?

No, it isn't, since the page might have been vacuumed since it was last
filled up.

regards, tom lane

From:	"Jaime Casanova" <systemguards(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-16 16:55:47
Message-ID:	c2d9e70e0706160955t4851efd3hfebdb8df7b6b5468@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 6/16/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Jaime Casanova" <systemguards(at)gmail(dot)com> writes:
> > another question: if the fillfactor is 100% then is a complete waste
> > of time to look for a suggested block. maybe we could check for that?
>
> No, it isn't, since the page might have been vacuumed since it was last
> filled up.
>
> regards, tom lane
>

ahh... that vacuum thing!!! yeah!!! ;)

--
regards,
Jaime Casanova

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jaime Casanova <systemguards(at)gmail(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-17 06:22:19
Message-ID:	4674D31B.9050000@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jaime Casanova wrote:
> On 5/27/07, Jim C. Nasby <decibel(at)decibel(dot)org> wrote:
>> On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote:
>> >
>> > IOW it's working as designed. But maybe it's not the desired behavior.
>> > Should we have a special case and always respect the fillfactor when
>> > inserting to the last page of the heap?
>>
>> I think that would be following with least surprise.
>
> What's the status of this patch? are we waiting an update?
>
> AFAIU, it's not fair to say that the patch maintain cluster order...
> it just try to keep similar rows on the same page if possible (it's
> not the same thing)... if it can't then it simply insert at the last
> page as usual but we have wasted time in the try...

That's right. Or more accurately, if there's no room on the same page,
it uses the FSM and if it still can't find room, it inserts at the last
page.

> so the real question is if there is any performance win on this...
> have you some numbers?

Hmm. I ran a small test with random inserts/deletes back in August. The
conclusion was that the table loses clustering a lot slower, but I can't
find the results now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-17 20:55:58
Message-ID:	10328.1182113758@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> The implementation has changed a bit since August. I thought I had
> submitted an updated version in the winter but couldn't find it. Anyway,
> I updated and dusted off the source tree, tidied up the comments a
> little bit, and fixed some inconsistencies in pg_proc entries that made
> opr_sanity to fail.

I started looking at this patch. My first reaction is that I liked last
August's API (an independent "suggestblock" call) a lot better. I think
trying to hold an index page lock across the heap insert is an extremely
bad idea; it will hurt concurrency and possibly cause deadlocks
(especially for unique indexes).

The other question is why is execMain involved in this? That makes the
design nonfunctional for tuples inserted in any other way than through
the main executor --- COPY for instance. Also, if this is successful,
I could see using it on system catalogs eventually. I'm inclined to
think that the right design is for heap_insert to call amsuggestblock
for itself, or maybe even push that down to RelationGetBufferForTuple.
(Note: having heap_insert contain logic that duplicates
RelationGetBufferForTuple's is another bit of bad design here, but
that's at least correctable locally.) Now the difficulty in that is
that the heapam.c routines don't have hold of any data structure
containing index knowledge ... but they do have hold of the Relation
structure for the heap. I suggest making RelationGetIndexList() cache
the OID of the clustered index (if any) in relcache entries, and add
RelationGetClusterIndex much like RelationGetOidIndex, and then
heap_insert can use that.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-18 05:52:55
Message-ID:	46761DB7.7070505@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> The implementation has changed a bit since August. I thought I had
>> submitted an updated version in the winter but couldn't find it. Anyway,
>> I updated and dusted off the source tree, tidied up the comments a
>> little bit, and fixed some inconsistencies in pg_proc entries that made
>> opr_sanity to fail.
>
> I started looking at this patch. My first reaction is that I liked last
> August's API (an independent "suggestblock" call) a lot better. I think
> trying to hold an index page lock across the heap insert is an extremely
> bad idea; it will hurt concurrency and possibly cause deadlocks
> (especially for unique indexes).

The index page is not kept locked across the calls. Just pinned.

The reason for switching to the new API instead of the amsuggestblock
API is CPU overhead. It avoids constructing the IndexTuple twice and
descending the tree twice.

Clustering is mainly useful when the table doesn't fit in cache, so one
could argue that if you care about clustering you're most likely I/O
bound and don't care about the CPU overhead that much. Nevertheless,
avoiding it seems like a good idea to me.

The amsuggestblock API is simpler, though. We might choose it on those
grounds alone.

> The other question is why is execMain involved in this? That makes the
> design nonfunctional for tuples inserted in any other way than through
> the main executor --- COPY for instance. Also, if this is successful,
> I could see using it on system catalogs eventually. I'm inclined to
> think that the right design is for heap_insert to call amsuggestblock
> for itself, or maybe even push that down to RelationGetBufferForTuple.
> (Note: having heap_insert contain logic that duplicates
> RelationGetBufferForTuple's is another bit of bad design here, but
> that's at least correctable locally.) Now the difficulty in that is
> that the heapam.c routines don't have hold of any data structure
> containing index knowledge ... but they do have hold of the Relation
> structure for the heap. I suggest making RelationGetIndexList() cache
> the OID of the clustered index (if any) in relcache entries, and add
> RelationGetClusterIndex much like RelationGetOidIndex, and then
> heap_insert can use that.

Hmm. My first reaction on that is that having heap_insert reach into an
index is a modularity violation. It wouldn't be too hard to make similar
changes to COPY that I did to the executor.

I doubt it's very useful for system catalogs; they should never grow
very large, and should stay mostly cached anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jaime Casanova" <systemguards(at)gmail(dot)com>, "PostgreSQL-patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-18 08:49:28
Message-ID:	87wsy1hckn.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> The reason for switching to the new API instead of the amsuggestblock API is
> CPU overhead. It avoids constructing the IndexTuple twice and descending the
> tree twice.

I wonder if it's possible to finesse this. Have the suggestblock function
remember the last block it suggested and somehow recognize when it's given the
same tuple to index. Keeping the block pinned would still be the sticky point
though.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-06-25 23:52:22
Message-ID:	8244.1182815542@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

[ back to this thread... ]

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> The other question is why is execMain involved in this? That makes the
>> design nonfunctional for tuples inserted in any other way than through
>> the main executor --- COPY for instance. Also, if this is successful,
>> I could see using it on system catalogs eventually. I'm inclined to
>> think that the right design is for heap_insert to call amsuggestblock
>> for itself, or maybe even push that down to RelationGetBufferForTuple.

> Hmm. My first reaction on that is that having heap_insert reach into an
> index is a modularity violation. It wouldn't be too hard to make similar
> changes to COPY that I did to the executor.

Well, this entire patch is a modularity violation by definition.
Bringing the executor into it just makes the damage worse, by involving
relatively high-level code in the communication between two relatively
low-level bits of code, and ensuring that we'll have to involve still
other bits of high-level code in future if we want to make the behavior
work in more scenarios.

There is a problem with my suggestion of relying on the relcache:
that would only give you the index relation's OID, not a handle to an
open Relation for it. I'm not sure this is a killer, however, as the
cost of an index_open/index_close cycle isn't really much more than a
dynahash search if the index is already open and locked by some upper
code level.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-09 23:29:32
Message-ID:	457.1184023772@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

[ back to the cluster-order patch ]

Awhile back, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> The performance characteristics of this patch hasn't been thoroughly
> discussed yet. The reason why you want to cluster your tables is to
> speed up SELECTs that return a bunch of tuples with similar values, for
> example range queries. The reason for keeping them clustered on inserts
> is to reduce the need to run CLUSTER as often.

> It doesn't come without a cost, however. In the worst case, there never
> is room for new inserts on pages, and each insert needs to do one extra
> I/O to fetch the optimal heap page where the insert should go, see that
> there's no room, and then insert somewhere else. Using a non-zero
> fillfactor helps, but even when there is room on the page, it's often
> cheaper to just append to the end of the table and running CLUSTER at
> night for example, than do random access to insert to the "right" pages
> in the heap.

I looked through the thread and could not find any clear evidence that
anyone had done any performance testing of this patch at all, so I
hacked together a crude test that alternates between inserting/deleting
a random subset of rows and SELECTing a range of rows. The results
do not look very good: while there is detectable improvement in the
SELECT performance, it's not much, and it comes at a *very* sizable
penalty in INSERT performance.

Attached is a test program, which I ran against today's CVS HEAD with
and without the v8 version of the patch. The test program sets up
a clustered million-row table with a row width chosen to fit about
20 rows per page. It then iterates a loop of:

* delete a random 2% of the table
* vacuum to recover space
* insert a random 2% of the table
* select (about) 1000 consecutively-numbered rows
* select all the rows (this is just a cross check that the
number of rows isn't changing too much)

What you would hope to see as the benefit of the patch is that the time
for the range SELECT degrades more slowly as more of the table is
replaced. Ignoring the first SELECT as being a startup transient, it
looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations
(20% replacement of the table), whereas with the patch it's about 3 msec
to about 4 and a half. However, the INSERT steps went from around 20
sec each to about twice that.

I used just the following nondefault parameter settings:

shared_buffers = 10000 # min 128kB or max_connections*16kB
maintenance_work_mem = 160MB # min 1MB
max_fsm_pages = 2048000 # min max_fsm_relations*16, 6 bytes each
wal_buffers = 640kB # min 32kB
checkpoint_segments = 30 # in logfile segments, min 1, 16MB each
enable_bitmapscan = off
enable_seqscan = off
autovacuum = off # enable autovacuum subprocess?

which for the most part are just 10x the factory defaults. The hardware
is just a Dell x86_64 workstation with crappy IDE disk, so maybe things
would look better elsewhere, but it's all I have to work with.

Considering that the patch is really severely ugly from a modularity and
layering standpoint, I'm now inclined to reject it. AFAICT this test
case is showing the patch at its best possible advantage; under
real-world conditions the benefit would be less. It doesn't look to me
like the gain is worth the loss of system understandability and
maintainability that the patch would impose.

regards, tom lane

Attachment	Content-Type	Size
unknown_filename	text/plain	3.4 KB
unknown_filename	text/plain	2.0 KB
unknown_filename	text/plain	2.0 KB

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jaime Casanova <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-10 09:42:02
Message-ID:	4693546A.2050709@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> What you would hope to see as the benefit of the patch is that the time
> for the range SELECT degrades more slowly as more of the table is
> replaced. Ignoring the first SELECT as being a startup transient, it
> looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations
> (20% replacement of the table), whereas with the patch it's about 3 msec
> to about 4 and a half. However, the INSERT steps went from around 20
> sec each to about twice that.

Clustering in general isn't very important on a table that fits in cache.

Also note that the overhead of descending the b-tree twice on INSERTs
hurts the most in a CPU bound test like that. The overhead of the
two-phase method I proposed should be lower.

Test 1:
> executed CREATE in 0.066563 sec
> executed INSERT in 40.465653 sec
> executed CREATE in 9.152698 sec
> executed CLUSTE in 20.036375 sec
> executed VACUUM in 1.440232 sec
Test 2:
> executed CREATE in 0.086862 sec
> executed INSERT in 50.746362 sec
> executed CREATE in 12.115655 sec
> executed CLUSTE in 33.656341 sec
> executed VACUUM in 4.306563 sec

Why is there such a big difference in all these initialization steps as
well? Is it just random noise?

I played a bit with that test program, results from my laptop attached.
I used a version patched with the latest patch I submitted, because
that's what I had readily available. I used the same patched binaries in
both test runs, I just didn't CLUSTER the table in the other run. The
main difference to your results is that the DELETE, VACUUM and INSERT
operations are much faster both with and without the patch. Most INSERTs
for example took < 1 s, and in your results they took > 15 s. Any idea why?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
clustered.log	text/x-log	3.2 KB
nonclustered.log	text/x-log	3.2 KB
changedconf.txt	text/x-patch	775 bytes

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Jaime Casanova" <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "PostgreSQL-patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-10 11:23:23
Message-ID:	87odikzems.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> * delete a random 2% of the table
> * vacuum to recover space
> * insert a random 2% of the table
> * select (about) 1000 consecutively-numbered rows
> * select all the rows (this is just a cross check that the
> number of rows isn't changing too much)
>
> What you would hope to see as the benefit of the patch is that the time
> for the range SELECT degrades more slowly as more of the table is
> replaced. Ignoring the first SELECT as being a startup transient, it
> looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations
> (20% replacement of the table), whereas with the patch it's about 3 msec
> to about 4 and a half. However, the INSERT steps went from around 20
> sec each to about twice that.

I would suggest:

a) continuing the test until you have 100% replacement.

b) defining the tables with a fill-factor large enough to hold at least one
extra tuple

c) considering this patch alongside GII where it is especially important.

It's pretty serious what you're suggesting since it means that we'll basically
never have a real cluster feature. I would sure hope we're missing something
and there's a way to make this work usefully.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jaime Casanova <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-10 11:43:06
Message-ID:	469370CA.1060608@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gregory Stark wrote:
> It's pretty serious what you're suggesting since it means that we'll basically
> never have a real cluster feature. I would sure hope we're missing something
> and there's a way to make this work usefully.

Another approach would be an online CLUSTER command. That means there'll
be a lot of churn when tuples need to be moved back and forth, along
with updating indexes, but it'd take the overhead out of the critical path.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-10 12:52:54
Message-ID:	Pine.GSO.4.64.0707100815400.22343@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 9 Jul 2007, Tom Lane wrote:

> The hardware is just a Dell x86_64 workstation with crappy IDE disk, so
> maybe things would look better elsewhere, but it's all I have to work
> with.

Do you have write-caching turned off on the drive so INSERTs are being
rate-limited by WAL syncs? Trying to characterize the class of setup
you're using.

The part that seemed curious to me about your results in the unpatched
version is why the first INSERT takes 2.4 seconds, while the second takes
12.2 then later ones settle from 17-23 s. I could understand the 12-23
variation, but that the first one fires off in 2.4 seems kind of fishy;
why so fast? Is there something that's just fitting in memory in that
case that's just missing in the patched version?

results-head:
...
executed DELETE in 14.770937 sec
executed VACUUM in 10.663301 sec
executed INSERT in 2.449248 sec (1st)
...
executed INSERT in 12.212027 sec (2nd)

results-patch:
...
executed DELETE in 18.062664 sec
executed VACUUM in 28.487570 sec
executed INSERT in 25.638022 sec (1st)
...
executed INSERT in 40.759404 sec (2nd)

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Jaime Casanova" <systemguards(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "PostgreSQL-patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-10 13:41:01
Message-ID:	18628.1184074861@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> It's pretty serious what you're suggesting since it means that we'll
> basically never have a real cluster feature.

It seems entirely likely that this is not the way to go about "real
clustering".

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-10 14:34:39
Message-ID:	19371.1184078079@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Mon, 9 Jul 2007, Tom Lane wrote:
>> The hardware is just a Dell x86_64 workstation with crappy IDE disk, so
>> maybe things would look better elsewhere, but it's all I have to work
>> with.

> Do you have write-caching turned off on the drive so INSERTs are being
> rate-limited by WAL syncs? Trying to characterize the class of setup
> you're using.

It's just desktop-grade junk :-(. Feel free to repeat the test on
something more serious.

> The part that seemed curious to me about your results in the unpatched
> version is why the first INSERT takes 2.4 seconds, while the second takes
> 12.2 then later ones settle from 17-23 s. I could understand the 12-23
> variation, but that the first one fires off in 2.4 seems kind of fishy;

The test seemed to be mostly I/O bound according to vmstat. I supposed
that it was fast up until the kernel's available space was full of dirty
pages, and then subsequent writes would be constrained by actual disk
write bandwidth. Anyway the numbers seemed relatively consistent after
the setup and first test cycle, so I think the part I was actually
trying to draw conclusions from was probably real enough.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Maintaining cluster order on insert
Date:	2007-07-10 15:25:30
Message-ID:	Pine.GSO.4.64.0707101045240.2388@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, 10 Jul 2007, Tom Lane wrote:

> It's just desktop-grade junk :-(. Feel free to repeat the test on
> something more serious.

Right, but even such junk can be setup such that the disks honor commits,
just wanted to confirm you didn't go out of your way to do that--sounds
like you didn't. My mini-lab at home has two PG test systems, one with a
real write cache, the other has an IDE drive with the cache turned off so
commits actually wait to hit the platter. The painful slowness of writes
in that situation keeps me grounded as to how much data integrity actually
costs if you're not accelerating it, and the differences in benchmarks
between the two systems (which are otherwise a similar class of hardware)
can be informative.

> Anyway the numbers seemed relatively consistent after the setup and
> first test cycle, so I think the part I was actually trying to draw
> conclusions from was probably real enough.

Agreed, just figuring out the test ramp-up situation, and your explanation
for that quirk sounds reasonable.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Maintaining cluster order on insert
Date:	2008-04-11 19:38:30
Message-ID:	200804111938.m3BJcUj02170@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

This idea has been rejected to do poor performance results reported
later in the thread.

---------------------------------------------------------------------------

[ text/x-patch is unsupported, treating like TEXT/PLAIN ]

> Index: doc/src/sgml/catalogs.sgml
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v
> retrieving revision 2.129
> diff -c -r2.129 catalogs.sgml
> *** doc/src/sgml/catalogs.sgml 31 Jul 2006 20:08:55 -0000 2.129
> --- doc/src/sgml/catalogs.sgml 8 Aug 2006 16:17:21 -0000
> ***************
> *** 499,504 ****
> --- 499,511 ----
> <entry>Function to parse and validate reloptions for an index</entry>
> </row>
>
> + <row>
> + <entry><structfield>amsuggestblock</structfield></entry>
> + <entry><type>regproc</type></entry>
> + <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry>
> + <entry>Get the best place in the heap to put a new tuple</entry>
> + </row>
> +
> </tbody>
> </tgroup>
> </table>
> Index: doc/src/sgml/indexam.sgml
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/indexam.sgml,v
> retrieving revision 2.16
> diff -c -r2.16 indexam.sgml
> *** doc/src/sgml/indexam.sgml 31 Jul 2006 20:08:59 -0000 2.16
> --- doc/src/sgml/indexam.sgml 8 Aug 2006 17:15:25 -0000
> ***************
> *** 391,396 ****
> --- 391,414 ----
> <function>amoptions</> to test validity of options settings.
> </para>
>
> + <para>
> + <programlisting>
> + BlockNumber
> + amsuggestblock (Relation indexRelation,
> + Datum *values,
> + bool *isnull,
> + Relation heapRelation);
> + </programlisting>
> + Gets the optimal place in the heap for a new tuple. The parameters
> + correspond the parameters for <literal>aminsert</literal>.
> + This function is called on the clustered index before a new tuple
> + is inserted to the heap, and it should choose the optimal insertion
> + target page on the heap in such manner that the heap stays as close
> + as possible to the index order.
> + <literal>amsuggestblock</literal> can return InvalidBlockNumber if
> + the index am doesn't have a suggestion.
> + </para>
> +
> </sect1>
>
> <sect1 id="index-scanning">
> Index: src/backend/access/heap/heapam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
> retrieving revision 1.218
> diff -c -r1.218 heapam.c
> *** src/backend/access/heap/heapam.c 31 Jul 2006 20:08:59 -0000 1.218
> --- src/backend/access/heap/heapam.c 8 Aug 2006 16:17:21 -0000
> ***************
> *** 1325,1330 ****
> --- 1325,1335 ----
> * use_fsm is passed directly to RelationGetBufferForTuple, which see for
> * more info.
> *
> + * suggested_blk can be set by the caller to hint heap_insert which
> + * block would be the best place to put the new tuple in. heap_insert can
> + * ignore the suggestion, if there's not enough room on that block.
> + * InvalidBlockNumber means no preference.
> + *
> * The return value is the OID assigned to the tuple (either here or by the
> * caller), or InvalidOid if no OID. The header fields of *tup are updated
> * to match the stored tuple; in particular tup->t_self receives the actual
> ***************
> *** 1333,1339 ****
> */
> Oid
> heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> ! bool use_wal, bool use_fsm)
> {
> TransactionId xid = GetCurrentTransactionId();
> HeapTuple heaptup;
> --- 1338,1344 ----
> */
> Oid
> heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> ! bool use_wal, bool use_fsm, BlockNumber suggested_blk)
> {
> TransactionId xid = GetCurrentTransactionId();
> HeapTuple heaptup;
> ***************
> *** 1386,1392 ****
>
> /* Find buffer to insert this tuple into */
> buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
> ! InvalidBuffer, use_fsm);
>
> /* NO EREPORT(ERROR) from here till changes are logged */
> START_CRIT_SECTION();
> --- 1391,1397 ----
>
> /* Find buffer to insert this tuple into */
> buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
> ! InvalidBuffer, use_fsm, suggested_blk);
>
> /* NO EREPORT(ERROR) from here till changes are logged */
> START_CRIT_SECTION();
> ***************
> *** 1494,1500 ****
> Oid
> simple_heap_insert(Relation relation, HeapTuple tup)
> {
> ! return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
> }
>
> /*
> --- 1499,1506 ----
> Oid
> simple_heap_insert(Relation relation, HeapTuple tup)
> {
> ! return heap_insert(relation, tup, GetCurrentCommandId(), true,
> ! true, InvalidBlockNumber);
> }
>
> /*
> ***************
> *** 2079,2085 ****
> {
> /* Assume there's no chance to put heaptup on same page. */
> newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> ! buffer, true);
> }
> else
> {
> --- 2085,2092 ----
> {
> /* Assume there's no chance to put heaptup on same page. */
> newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> ! buffer, true,
> ! InvalidBlockNumber);
> }
> else
> {
> ***************
> *** 2096,2102 ****
> */
> LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> ! buffer, true);
> }
> else
> {
> --- 2103,2110 ----
> */
> LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> ! buffer, true,
> ! InvalidBlockNumber);
> }
> else
> {
> Index: src/backend/access/heap/hio.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/hio.c,v
> retrieving revision 1.63
> diff -c -r1.63 hio.c
> *** src/backend/access/heap/hio.c 3 Jul 2006 22:45:37 -0000 1.63
> --- src/backend/access/heap/hio.c 9 Aug 2006 18:03:01 -0000
> ***************
> *** 93,98 ****
> --- 93,100 ----
> * any committed data of other transactions. (See heap_insert's comments
> * for additional constraints needed for safe usage of this behavior.)
> *
> + * If the caller has a suggestion, it's passed in suggestedBlock.
> + *
> * We always try to avoid filling existing pages further than the fillfactor.
> * This is OK since this routine is not consulted when updating a tuple and
> * keeping it on the same page, which is the scenario fillfactor is meant
> ***************
> *** 103,109 ****
> */
> Buffer
> RelationGetBufferForTuple(Relation relation, Size len,
> ! Buffer otherBuffer, bool use_fsm)
> {
> Buffer buffer = InvalidBuffer;
> Page pageHeader;
> --- 105,112 ----
> */
> Buffer
> RelationGetBufferForTuple(Relation relation, Size len,
> ! Buffer otherBuffer, bool use_fsm,
> ! BlockNumber suggestedBlock)
> {
> Buffer buffer = InvalidBuffer;
> Page pageHeader;
> ***************
> *** 135,142 ****
> otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */
>
> /*
> ! * We first try to put the tuple on the same page we last inserted a tuple
> ! * on, as cached in the relcache entry. If that doesn't work, we ask the
> * shared Free Space Map to locate a suitable page. Since the FSM's info
> * might be out of date, we have to be prepared to loop around and retry
> * multiple times. (To insure this isn't an infinite loop, we must update
> --- 138,147 ----
> otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */
>
> /*
> ! * We first try to put the tuple on the page suggested by the caller, if
> ! * any. Then we try to put the tuple on the same page we last inserted a
> ! * tuple on, as cached in the relcache entry. If that doesn't work, we
> ! * ask the
> * shared Free Space Map to locate a suitable page. Since the FSM's info
> * might be out of date, we have to be prepared to loop around and retry
> * multiple times. (To insure this isn't an infinite loop, we must update
> ***************
> *** 144,152 ****
> * not to be suitable.) If the FSM has no record of a page with enough
> * free space, we give up and extend the relation.
> *
> ! * When use_fsm is false, we either put the tuple onto the existing target
> ! * page or extend the relation.
> */
> if (len + saveFreeSpace <= MaxTupleSize)
> targetBlock = relation->rd_targblock;
> else
> --- 149,167 ----
> * not to be suitable.) If the FSM has no record of a page with enough
> * free space, we give up and extend the relation.
> *
> ! * When use_fsm is false, we skip the fsm lookup if neither the suggested
> ! * nor the cached last insertion page has enough room, and extend the
> ! * relation.
> ! *
> ! * The fillfactor is taken into account when calculating the free space
> ! * on the cached target block, and when using the FSM. The suggested page
> ! * is used whenever there's enough room in it, regardless of the fillfactor,
> ! * because that's exactly the purpose the space is reserved for in the
> ! * first place.
> */
> + if (suggestedBlock != InvalidBlockNumber)
> + targetBlock = suggestedBlock;
> + else
> if (len + saveFreeSpace <= MaxTupleSize)
> targetBlock = relation->rd_targblock;
> else
> ***************
> *** 219,224 ****
> --- 234,244 ----
> */
> pageHeader = (Page) BufferGetPage(buffer);
> pageFreeSpace = PageGetFreeSpace(pageHeader);
> +
> + /* If we're trying the suggested block, don't care about fillfactor */
> + if (targetBlock == suggestedBlock && len <= pageFreeSpace)
> + return buffer;
> +
> if (len + saveFreeSpace <= pageFreeSpace)
> {
> /* use this page as future insert target, too */
> ***************
> *** 241,246 ****
> --- 261,275 ----
> ReleaseBuffer(buffer);
> }
>
> + /* If we just tried the suggested block, try the cached target
> + * block next, before consulting the FSM. */
> + if(suggestedBlock == targetBlock)
> + {
> + targetBlock = relation->rd_targblock;
> + suggestedBlock = InvalidBlockNumber;
> + continue;
> + }
> +
> /* Without FSM, always fall out of the loop and extend */
> if (!use_fsm)
> break;
> Index: src/backend/access/index/genam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/genam.c,v
> retrieving revision 1.58
> diff -c -r1.58 genam.c
> *** src/backend/access/index/genam.c 31 Jul 2006 20:08:59 -0000 1.58
> --- src/backend/access/index/genam.c 8 Aug 2006 16:17:21 -0000
> ***************
> *** 259,261 ****
> --- 259,275 ----
>
> pfree(sysscan);
> }
> +
> + /*
> + * This is a dummy implementation of amsuggestblock, to be used for index
> + * access methods that don't or can't support it. It just returns
> + * InvalidBlockNumber, which means "no preference".
> + *
> + * This is probably not a good best place for this function, but it doesn't
> + * fit naturally anywhere else either.
> + */
> + Datum
> + dummysuggestblock(PG_FUNCTION_ARGS)
> + {
> + PG_RETURN_UINT32(InvalidBlockNumber);
> + }
> Index: src/backend/access/index/indexam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v
> retrieving revision 1.94
> diff -c -r1.94 indexam.c
> *** src/backend/access/index/indexam.c 31 Jul 2006 20:08:59 -0000 1.94
> --- src/backend/access/index/indexam.c 8 Aug 2006 16:17:21 -0000
> ***************
> *** 18,23 ****
> --- 18,24 ----
> * index_rescan - restart a scan of an index
> * index_endscan - end a scan
> * index_insert - insert an index tuple into a relation
> + * index_suggestblock - get desired insert location for a heap tuple
> * index_markpos - mark a scan position
> * index_restrpos - restore a scan position
> * index_getnext - get the next tuple from a scan
> ***************
> *** 202,207 ****
> --- 203,237 ----
> BoolGetDatum(check_uniqueness)));
> }
>
> + /* ----------------
> + * index_suggestblock - get desired insert location for a heap tuple
> + *
> + * The returned BlockNumber is the *heap* page that is the best place
> + * to insert the given tuple to, according to the index am. The best
> + * place is usually one that maintains the cluster order.
> + * ----------------
> + */
> + BlockNumber
> + index_suggestblock(Relation indexRelation,
> + Datum *values,
> + bool *isnull,
> + Relation heapRelation)
> + {
> + FmgrInfo *procedure;
> +
> + RELATION_CHECKS;
> + GET_REL_PROCEDURE(amsuggestblock);
> +
> + /*
> + * have the am's suggestblock proc do all the work.
> + */
> + return DatumGetUInt32(FunctionCall4(procedure,
> + PointerGetDatum(indexRelation),
> + PointerGetDatum(values),
> + PointerGetDatum(isnull),
> + PointerGetDatum(heapRelation)));
> + }
> +
> /*
> * index_beginscan - start a scan of an index with amgettuple
> *
> Index: src/backend/access/nbtree/nbtinsert.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v
> retrieving revision 1.142
> diff -c -r1.142 nbtinsert.c
> *** src/backend/access/nbtree/nbtinsert.c 25 Jul 2006 19:13:00 -0000 1.142
> --- src/backend/access/nbtree/nbtinsert.c 9 Aug 2006 17:51:33 -0000
> ***************
> *** 146,151 ****
> --- 146,221 ----
> }
>
> /*
> + * _bt_suggestblock() -- Find the heap block of the closest index tuple.
> + *
> + * The logic to find the target should match _bt_doinsert, otherwise
> + * we'll be making bad suggestions.
> + */
> + BlockNumber
> + _bt_suggestblock(Relation rel, IndexTuple itup, Relation heapRel)
> + {
> + int natts = rel->rd_rel->relnatts;
> + OffsetNumber offset;
> + Page page;
> + BTPageOpaque opaque;
> +
> + ScanKey itup_scankey;
> + BTStack stack;
> + Buffer buf;
> + IndexTuple curitup;
> + BlockNumber suggestion = InvalidBlockNumber;
> +
> + /* we need an insertion scan key to do our search, so build one */
> + itup_scankey = _bt_mkscankey(rel, itup);
> +
> + /* find the first page containing this key */
> + stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ);
> + if(!BufferIsValid(buf))
> + {
> + /* The index was completely empty. No suggestion then. */
> + return InvalidBlockNumber;
> + }
> + /* we don't need the stack, so free it right away */
> + _bt_freestack(stack);
> +
> + page = BufferGetPage(buf);
> + opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> +
> + /* Find the location in the page where the new index tuple would go to. */
> +
> + offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
> + if (offset > PageGetMaxOffsetNumber(page))
> + {
> + /* _bt_binsrch returned pointer to end-of-page. It means that
> + * there was no equal items on the page, and the new item should
> + * be inserted as the last tuple of the page. There could be equal
> + * items on the next page, however.
> + *
> + * At the moment, we just ignore the potential equal items on the
> + * right, and pretend there isn't any. We could instead walk right
> + * to the next page to check that, but let's keep it simple for now.
> + */
> + offset = OffsetNumberPrev(offset);
> + }
> + if(offset < P_FIRSTDATAKEY(opaque))
> + {
> + /* We landed on an empty page. We could step left or right until
> + * we find some items, but let's keep it simple for now.
> + */
> + } else {
> + /* We're now positioned at the index tuple that we're interested in. */
> +
> + curitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
> + suggestion = ItemPointerGetBlockNumber(&curitup->t_tid);
> + }
> +
> + _bt_relbuf(rel, buf);
> + _bt_freeskey(itup_scankey);
> +
> + return suggestion;
> + }
> +
> + /*
> * _bt_check_unique() -- Check for violation of unique index constraint
> *
> * Returns InvalidTransactionId if there is no conflict, else an xact ID
> Index: src/backend/access/nbtree/nbtree.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v
> retrieving revision 1.149
> diff -c -r1.149 nbtree.c
> *** src/backend/access/nbtree/nbtree.c 10 May 2006 23:18:39 -0000 1.149
> --- src/backend/access/nbtree/nbtree.c 9 Aug 2006 18:04:02 -0000
> ***************
> *** 228,233 ****
> --- 228,265 ----
> }
>
> /*
> + * btsuggestblock() -- find the best place in the heap to put a new tuple.
> + *
> + * This uses the same logic as btinsert to find the place where the index
> + * tuple would go if this was a btinsert call.
> + *
> + * There's room for improvement here. An insert operation will descend
> + * the tree twice, first by btsuggestblock, then by btinsert. Things
> + * might have changed in between, so that the heap tuple is actually
> + * not inserted in the optimal page, but since this is just an
> + * optimization, it's ok if it happens sometimes.
> + */
> + Datum
> + btsuggestblock(PG_FUNCTION_ARGS)
> + {
> + Relation rel = (Relation) PG_GETARG_POINTER(0);
> + Datum *values = (Datum *) PG_GETARG_POINTER(1);
> + bool *isnull = (bool *) PG_GETARG_POINTER(2);
> + Relation heapRel = (Relation) PG_GETARG_POINTER(3);
> + IndexTuple itup;
> + BlockNumber suggestion;
> +
> + /* generate an index tuple */
> + itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
> +
> + suggestion =_bt_suggestblock(rel, itup, heapRel);
> +
> + pfree(itup);
> +
> + PG_RETURN_UINT32(suggestion);
> + }
> +
> + /*
> * btgettuple() -- Get the next tuple in the scan.
> */
> Datum
> Index: src/backend/executor/execMain.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v
> retrieving revision 1.277
> diff -c -r1.277 execMain.c
> *** src/backend/executor/execMain.c 31 Jul 2006 01:16:37 -0000 1.277
> --- src/backend/executor/execMain.c 8 Aug 2006 16:17:21 -0000
> ***************
> *** 892,897 ****
> --- 892,898 ----
> resultRelInfo->ri_RangeTableIndex = resultRelationIndex;
> resultRelInfo->ri_RelationDesc = resultRelationDesc;
> resultRelInfo->ri_NumIndices = 0;
> + resultRelInfo->ri_ClusterIndex = -1;
> resultRelInfo->ri_IndexRelationDescs = NULL;
> resultRelInfo->ri_IndexRelationInfo = NULL;
> /* make a copy so as not to depend on relcache info not changing... */
> ***************
> *** 1388,1394 ****
> heap_insert(estate->es_into_relation_descriptor, tuple,
> estate->es_snapshot->curcid,
> estate->es_into_relation_use_wal,
> ! false); /* never any point in using FSM */
> /* we know there are no indexes to update */
> heap_freetuple(tuple);
> IncrAppended();
> --- 1389,1396 ----
> heap_insert(estate->es_into_relation_descriptor, tuple,
> estate->es_snapshot->curcid,
> estate->es_into_relation_use_wal,
> ! false, /* never any point in using FSM */
> ! InvalidBlockNumber);
> /* we know there are no indexes to update */
> heap_freetuple(tuple);
> IncrAppended();
> ***************
> *** 1419,1424 ****
> --- 1421,1427 ----
> ResultRelInfo *resultRelInfo;
> Relation resultRelationDesc;
> Oid newId;
> + BlockNumber suggestedBlock;
>
> /*
> * get the heap tuple out of the tuple table slot, making sure we have a
> ***************
> *** 1467,1472 ****
> --- 1470,1479 ----
> if (resultRelationDesc->rd_att->constr)
> ExecConstraints(resultRelInfo, slot, estate);
>
> + /* Ask the index am of the clustered index for the
> + * best place to put it */
> + suggestedBlock = ExecSuggestBlock(slot, estate);
> +
> /*
> * insert the tuple
> *
> ***************
> *** 1475,1481 ****
> */
> newId = heap_insert(resultRelationDesc, tuple,
> estate->es_snapshot->curcid,
> ! true, true);
>
> IncrAppended();
> (estate->es_processed)++;
> --- 1482,1488 ----
> */
> newId = heap_insert(resultRelationDesc, tuple,
> estate->es_snapshot->curcid,
> ! true, true, suggestedBlock);
>
> IncrAppended();
> (estate->es_processed)++;
> Index: src/backend/executor/execUtils.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v
> retrieving revision 1.139
> diff -c -r1.139 execUtils.c
> *** src/backend/executor/execUtils.c 4 Aug 2006 21:33:36 -0000 1.139
> --- src/backend/executor/execUtils.c 9 Aug 2006 18:05:05 -0000
> ***************
> *** 31,36 ****
> --- 31,37 ----
> * ExecOpenIndices \
> * ExecCloseIndices | referenced by InitPlan, EndPlan,
> * ExecInsertIndexTuples / ExecInsert, ExecUpdate
> + * ExecSuggestBlock Referenced by ExecInsert
> *
> * RegisterExprContextCallback Register function shutdown callback
> * UnregisterExprContextCallback Deregister function shutdown callback
> ***************
> *** 874,879 ****
> --- 875,881 ----
> IndexInfo **indexInfoArray;
>
> resultRelInfo->ri_NumIndices = 0;
> + resultRelInfo->ri_ClusterIndex = -1;
>
> /* fast path if no indexes */
> if (!RelationGetForm(resultRelation)->relhasindex)
> ***************
> *** 913,918 ****
> --- 915,925 ----
> /* extract index key information from the index's pg_index info */
> ii = BuildIndexInfo(indexDesc);
>
> + /* Remember which index is the clustered one.
> + * It's used to call the suggestblock-method on inserts */
> + if(indexDesc->rd_index->indisclustered)
> + resultRelInfo->ri_ClusterIndex = i;
> +
> relationDescs[i] = indexDesc;
> indexInfoArray[i] = ii;
> i++;
> ***************
> *** 1062,1067 ****
> --- 1069,1137 ----
> }
> }
>
> + /* ----------------------------------------------------------------
> + * ExecSuggestBlock
> + *
> + * This routine asks the index am where a new heap tuple
> + * should be placed.
> + * ----------------------------------------------------------------
> + */
> + BlockNumber
> + ExecSuggestBlock(TupleTableSlot *slot,
> + EState *estate)
> + {
> + ResultRelInfo *resultRelInfo;
> + int i;
> + Relation relationDesc;
> + Relation heapRelation;
> + ExprContext *econtext;
> + Datum values[INDEX_MAX_KEYS];
> + bool isnull[INDEX_MAX_KEYS];
> + IndexInfo *indexInfo;
> +
> + /*
> + * Get information from the result relation info structure.
> + */
> + resultRelInfo = estate->es_result_relation_info;
> + i = resultRelInfo->ri_ClusterIndex;
> + if(i == -1)
> + return InvalidBlockNumber; /* there was no clustered index */
> +
> + heapRelation = resultRelInfo->ri_RelationDesc;
> + relationDesc = resultRelInfo->ri_IndexRelationDescs[i];
> + indexInfo = resultRelInfo->ri_IndexRelationInfo[i];
> +
> + /* You can't cluster on a partial index */
> + Assert(indexInfo->ii_Predicate == NIL);
> +
> + /*
> + * We will use the EState's per-tuple context for evaluating
> + * index expressions (creating it if it's not already there).
> + */
> + econtext = GetPerTupleExprContext(estate);
> +
> + /* Arrange for econtext's scan tuple to be the tuple under test */
> + econtext->ecxt_scantuple = slot;
> +
> + /*
> + * FormIndexDatum fills in its values and isnull parameters with the
> + * appropriate values for the column(s) of the index.
> + */
> + FormIndexDatum(indexInfo,
> + slot,
> + estate,
> + values,
> + isnull);
> +
> + /*
> + * The index AM does the rest.
> + */
> + return index_suggestblock(relationDesc, /* index relation */
> + values, /* array of index Datums */
> + isnull, /* null flags */
> + heapRelation);
> + }
> +
> /*
> * UpdateChangedParamSet
> * Add changed parameters to a plan node's chgParam set
> Index: src/include/access/genam.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v
> retrieving revision 1.65
> diff -c -r1.65 genam.h
> *** src/include/access/genam.h 31 Jul 2006 20:09:05 -0000 1.65
> --- src/include/access/genam.h 9 Aug 2006 17:53:44 -0000
> ***************
> *** 93,98 ****
> --- 93,101 ----
> ItemPointer heap_t_ctid,
> Relation heapRelation,
> bool check_uniqueness);
> + extern BlockNumber index_suggestblock(Relation indexRelation,
> + Datum *values, bool *isnull,
> + Relation heapRelation);
>
> extern IndexScanDesc index_beginscan(Relation heapRelation,
> Relation indexRelation,
> ***************
> *** 123,128 ****
> --- 126,133 ----
> extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
> uint16 procnum);
>
> + extern Datum dummysuggestblock(PG_FUNCTION_ARGS);
> +
> /*
> * index access method support routines (in genam.c)
> */
> Index: src/include/access/heapam.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v
> retrieving revision 1.114
> diff -c -r1.114 heapam.h
> *** src/include/access/heapam.h 3 Jul 2006 22:45:39 -0000 1.114
> --- src/include/access/heapam.h 8 Aug 2006 16:17:21 -0000
> ***************
> *** 156,162 ****
> extern void setLastTid(const ItemPointer tid);
>
> extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> ! bool use_wal, bool use_fsm);
> extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
> ItemPointer ctid, TransactionId *update_xmax,
> CommandId cid, Snapshot crosscheck, bool wait);
> --- 156,162 ----
> extern void setLastTid(const ItemPointer tid);
>
> extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> ! bool use_wal, bool use_fsm, BlockNumber suggestedblk);
> extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
> ItemPointer ctid, TransactionId *update_xmax,
> CommandId cid, Snapshot crosscheck, bool wait);
> Index: src/include/access/hio.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/hio.h,v
> retrieving revision 1.32
> diff -c -r1.32 hio.h
> *** src/include/access/hio.h 13 Jul 2006 17:47:01 -0000 1.32
> --- src/include/access/hio.h 8 Aug 2006 16:17:21 -0000
> ***************
> *** 21,26 ****
> extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
> HeapTuple tuple);
> extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
> ! Buffer otherBuffer, bool use_fsm);
>
> #endif /* HIO_H */
> --- 21,26 ----
> extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
> HeapTuple tuple);
> extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
> ! Buffer otherBuffer, bool use_fsm, BlockNumber suggestedblk);
>
> #endif /* HIO_H */
> Index: src/include/access/nbtree.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v
> retrieving revision 1.103
> diff -c -r1.103 nbtree.h
> *** src/include/access/nbtree.h 7 Aug 2006 16:57:57 -0000 1.103
> --- src/include/access/nbtree.h 8 Aug 2006 16:17:21 -0000
> ***************
> *** 467,472 ****
> --- 467,473 ----
> extern Datum btbulkdelete(PG_FUNCTION_ARGS);
> extern Datum btvacuumcleanup(PG_FUNCTION_ARGS);
> extern Datum btoptions(PG_FUNCTION_ARGS);
> + extern Datum btsuggestblock(PG_FUNCTION_ARGS);
>
> /*
> * prototypes for functions in nbtinsert.c
> ***************
> *** 476,481 ****
> --- 477,484 ----
> extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
> extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
> BTStack stack, bool is_root, bool is_only);
> + extern BlockNumber _bt_suggestblock(Relation rel, IndexTuple itup,
> + Relation heapRel);
>
> /*
> * prototypes for functions in nbtpage.c
> Index: src/include/catalog/pg_am.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v
> retrieving revision 1.46
> diff -c -r1.46 pg_am.h
> *** src/include/catalog/pg_am.h 31 Jul 2006 20:09:05 -0000 1.46
> --- src/include/catalog/pg_am.h 8 Aug 2006 16:17:21 -0000
> ***************
> *** 65,70 ****
> --- 65,71 ----
> regproc amvacuumcleanup; /* post-VACUUM cleanup function */
> regproc amcostestimate; /* estimate cost of an indexscan */
> regproc amoptions; /* parse AM-specific parameters */
> + regproc amsuggestblock; /* suggest a block where to put heap tuple */
> } FormData_pg_am;
>
> /* ----------------
> ***************
> *** 78,84 ****
> * compiler constants for pg_am
> * ----------------
> */
> ! #define Natts_pg_am 23
> #define Anum_pg_am_amname 1
> #define Anum_pg_am_amstrategies 2
> #define Anum_pg_am_amsupport 3
> --- 79,85 ----
> * compiler constants for pg_am
> * ----------------
> */
> ! #define Natts_pg_am 24
> #define Anum_pg_am_amname 1
> #define Anum_pg_am_amstrategies 2
> #define Anum_pg_am_amsupport 3
> ***************
> *** 102,123 ****
> #define Anum_pg_am_amvacuumcleanup 21
> #define Anum_pg_am_amcostestimate 22
> #define Anum_pg_am_amoptions 23
>
> /* ----------------
> * initial contents of pg_am
> * ----------------
> */
>
> ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkpos btrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions ));
> DESCR("b-tree index access method");
> #define BTREE_AM_OID 403
> ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions ));
> DESCR("hash index access method");
> #define HASH_AM_OID 405
> ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions ));
> DESCR("GiST index access method");
> #define GIST_AM_OID 783
> ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions ));
> DESCR("GIN index access method");
> #define GIN_AM_OID 2742
>
> --- 103,125 ----
> #define Anum_pg_am_amvacuumcleanup 21
> #define Anum_pg_am_amcostestimate 22
> #define Anum_pg_am_amoptions 23
> + #define Anum_pg_am_amsuggestblock 24
>
> /* ----------------
> * initial contents of pg_am
> * ----------------
> */
>
> ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkpos btrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btsuggestblock));
> DESCR("b-tree index access method");
> #define BTREE_AM_OID 403
> ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions dummysuggestblock));
> DESCR("hash index access method");
> #define HASH_AM_OID 405
> ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions dummysuggestblock));
> DESCR("GiST index access method");
> #define GIST_AM_OID 783
> ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions dummysuggestblock ));
> DESCR("GIN index access method");
> #define GIN_AM_OID 2742
>
> Index: src/include/catalog/pg_proc.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v
> retrieving revision 1.420
> diff -c -r1.420 pg_proc.h
> *** src/include/catalog/pg_proc.h 6 Aug 2006 03:53:44 -0000 1.420
> --- src/include/catalog/pg_proc.h 9 Aug 2006 18:06:44 -0000
> ***************
> *** 682,687 ****
> --- 682,689 ----
> DESCR("btree(internal)");
> DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 f f t f s 2 17 "1009 16" _null_ _null_ _null_ btoptions - _null_ ));
> DESCR("btree(internal)");
> + DATA(insert OID = 2852 ( btsuggestblock PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_ btsuggestblock - _null_ ));
> + DESCR("btree(internal)");
>
> DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same - _null_ ));
> DESCR("same as?");
> ***************
> *** 3936,3941 ****
> --- 3938,3946 ----
> DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 f f t f i 2 16 "2277 2277" _null_ _null_ _null_ arraycontained - _null_ ));
> DESCR("anyarray contained");
>
> + DATA(insert OID = 2853 ( dummysuggestblock PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_ dummysuggestblock - _null_ ));
> + DESCR("dummy amsuggestblock implementation (internal)");
> +
> /*
> * Symbolic values for provolatile column: these indicate whether the result
> * of a function is dependent *only* on the values of its explicit arguments,
> Index: src/include/executor/executor.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v
> retrieving revision 1.128
> diff -c -r1.128 executor.h
> *** src/include/executor/executor.h 4 Aug 2006 21:33:36 -0000 1.128
> --- src/include/executor/executor.h 8 Aug 2006 16:17:21 -0000
> ***************
> *** 271,276 ****
> --- 271,277 ----
> extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
> extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
> EState *estate, bool is_vacuum);
> + extern BlockNumber ExecSuggestBlock(TupleTableSlot *slot, EState *estate);
>
> extern void RegisterExprContextCallback(ExprContext *econtext,
> ExprContextCallbackFunction function,
> Index: src/include/nodes/execnodes.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v
> retrieving revision 1.158
> diff -c -r1.158 execnodes.h
> *** src/include/nodes/execnodes.h 4 Aug 2006 21:33:36 -0000 1.158
> --- src/include/nodes/execnodes.h 8 Aug 2006 16:17:21 -0000
> ***************
> *** 257,262 ****
> --- 257,264 ----
> * NumIndices # of indices existing on result relation
> * IndexRelationDescs array of relation descriptors for indices
> * IndexRelationInfo array of key/attr info for indices
> + * ClusterIndex index to the IndexRelationInfo array of the
> + * clustered index, or -1 if there's none
> * TrigDesc triggers to be fired, if any
> * TrigFunctions cached lookup info for trigger functions
> * TrigInstrument optional runtime measurements for triggers
> ***************
> *** 272,277 ****
> --- 274,280 ----
> int ri_NumIndices;
> RelationPtr ri_IndexRelationDescs;
> IndexInfo **ri_IndexRelationInfo;
> + int ri_ClusterIndex;
> TriggerDesc *ri_TrigDesc;
> FmgrInfo *ri_TrigFunctions;
> struct Instrumentation *ri_TrigInstrument;
> Index: src/include/utils/rel.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v
> retrieving revision 1.91
> diff -c -r1.91 rel.h
> *** src/include/utils/rel.h 3 Jul 2006 22:45:41 -0000 1.91
> --- src/include/utils/rel.h 8 Aug 2006 16:17:21 -0000
> ***************
> *** 116,121 ****
> --- 116,122 ----
> FmgrInfo amvacuumcleanup;
> FmgrInfo amcostestimate;
> FmgrInfo amoptions;
> + FmgrInfo amsuggestblock;
> } RelationAmInfo;
>
>

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +