Re: Grouped Index Tuples / Clustered Indexes

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-07 10:32:35
Message-ID: 45EE94C3.3090906@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've updated the GIT patch at http://community.enterprisedb.com/git/.
Bitrot caused by the findinsertloc-patch has been fixed, making that
part of the GIT patch a little bit smaller and cleaner. I also did some
refactoring, and minor cleanup and commenting.

Any comments on the design or patch? For your convenience, I copied the
same text I added to access/nbtree/README to
http://community.enterprisedb.com/git/git-readme.txt

Should we start playing the name game at this point? I've been thinking
we should call this feature just Clustered Indexes, even though it's not
exactly the same thing as clustered indexes in other DBMSs. From user
point of view, they behave similarly enough that it may be best to use
the existing term.

As a next step, I'm hoping to get the indexam API changes from the
bitmap index patch committed soon, and in a way that supports GIT as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-07 11:53:08
Message-ID: 200703071253.09046@gj-laptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

my only question would be.
Why isn't that in core already ?


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-07 14:02:01
Message-ID: C2143009.28B5D%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

+1

On 3/7/07 6:53 AM, "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl> wrote:

> my only question would be.
> Why isn't that in core already ?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-07 21:59:26
Message-ID: 1173304766.3641.14.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
> I've been thinking
> we should call this feature just Clustered Indexes

Works for me.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-08 01:01:51
Message-ID: 878xe88s5c.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
>> I've been thinking
>> we should call this feature just Clustered Indexes

So we would have "clustered tables" which are tables whose heap is ordered
according to an index and separately "clustered indexes" which are indexes
optimized for such tables?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-11 11:22:19
Message-ID: 45F3E66B.2020100@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
>> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
>>> I've been thinking
>>> we should call this feature just Clustered Indexes
>
> So we would have "clustered tables" which are tables whose heap is ordered
> according to an index and separately "clustered indexes" which are indexes
> optimized for such tables?

Yes, that's what I was thinking.

There's a third related term in use as well. When you issue CLUSTER, the
table will be clustered on an index. And that index is then the "index
the table is clustered on". That's a bit cumbersome but that's the
terminology we're using at the moment. Maybe we should to come up with a
new term for that to avoid confusion..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-11 18:06:11
Message-ID: 45F44513.5090505@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> There's a third related term in use as well. When you issue CLUSTER, the
> table will be clustered on an index. And that index is then the "index
> the table is clustered on". That's a bit cumbersome but that's the
> terminology we're using at the moment. Maybe we should to come up with a
> new term for that to avoid confusion..

This reminds me of something i've been wondering about for quite some
time. Why is it that one has to write "cluster <index> on <table>",
and not "cluster <table> on <index>"?

To me, the second variant would seem more logical, but then I'm
not a native english speaker...

I'm not suggesting that this should be changed, I'm just wondering
why it is the way it is.

greetings, Florian Pflug


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-11 19:54:56
Message-ID: 1173642897.3641.465.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2007-03-11 at 11:22 +0000, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> >> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
> >>> I've been thinking
> >>> we should call this feature just Clustered Indexes
> >
> > So we would have "clustered tables" which are tables whose heap is ordered
> > according to an index and separately "clustered indexes" which are indexes
> > optimized for such tables?
>
> Yes, that's what I was thinking.
>
> There's a third related term in use as well. When you issue CLUSTER, the
> table will be clustered on an index. And that index is then the "index
> the table is clustered on". That's a bit cumbersome but that's the
> terminology we're using at the moment. Maybe we should to come up with a
> new term for that to avoid confusion..

First thought: we can use the term "cluster*ing* index" for CLUSTER and
use the term "clustered" to refer to what has happened to the table and
the index. That will probably be confused with high availability
clustering, so perhaps not.

Better thought: say that CLUSTER requires an "order-defining index".
That better explains the point that it is the table being clustered,
using the index to define the physical order of the rows in the heap. We
then use the word "clustered" to refer to what has happened to the
table, and with this patch, for the index also.

That way we can have new syntax for CLUSTER

CLUSTER table ORDER BY indexname

which is then the preferred syntax, rather than the perverse

CLUSTER index ON table

which gives the wrong impression about what is happening, since it is
the table that is changed, not the index.

- - -

- Are you suggesting that we have an explicit new syntax

CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...

or just that we refer to this feature as Clustered Indexes?

- Do we still need the index WITH option, in either case?

- Do you think that all Primary Keys should be clustered?

- Are you thinking to rename docs, catalog etc to reflect the new
naming/meaning?

My thinking would be: CLUSTERED, no, yes, yes
but I'd like to know what you think?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-11 19:56:48
Message-ID: 1173643008.3641.468.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2007-03-11 at 19:06 +0100, Florian G. Pflug wrote:
> Heikki Linnakangas wrote:
> > There's a third related term in use as well. When you issue CLUSTER, the
> > table will be clustered on an index. And that index is then the "index
> > the table is clustered on". That's a bit cumbersome but that's the
> > terminology we're using at the moment. Maybe we should to come up with a
> > new term for that to avoid confusion..
>
> This reminds me of something i've been wondering about for quite some
> time. Why is it that one has to write "cluster <index> on <table>",
> and not "cluster <table> on <index>"?
>
> To me, the second variant would seem more logical, but then I'm
> not a native english speaker...
>
> I'm not suggesting that this should be changed, I'm just wondering
> why it is the way it is.

No idea, but I agree it conveys exactly the opposite view of what
happens when the command is issued.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-12 12:53:09
Message-ID: 45F54D35.7090204@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> Better thought: say that CLUSTER requires an "order-defining index".
> That better explains the point that it is the table being clustered,
> using the index to define the physical order of the rows in the heap. We
> then use the word "clustered" to refer to what has happened to the
> table, and with this patch, for the index also.
>
> That way we can have new syntax for CLUSTER
>
> CLUSTER table ORDER BY indexname
>
> which is then the preferred syntax, rather than the perverse
>
> CLUSTER index ON table
>
> which gives the wrong impression about what is happening, since it is
> the table that is changed, not the index.

I like that, "order-defining index" conveys the point pretty well.

> - Are you suggesting that we have an explicit new syntax
>
> CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...
>
> or just that we refer to this feature as Clustered Indexes?

I'm not proposing new syntax, just a WITH-parameter. Makes more sense to
me that way, the clusteredness has no user-visible effects except
performance, and it's b-tree specific (though I guess you could apply
the same concept to other indexams as well).

> - Do you think that all Primary Keys should be clustered?

No. There's a significant CPU overhead when the index and table are in
memory and you're doing simple one-row lookups. And there's no promise
that a table is physically in primary key order anyway.

There might be some interesting cases where we could enable it
automatically. I've been thinking that if you explicitly CLUSTER a
table, the order-defining index would definitely benefit from being a
clustered index. If it's small enough that it fits in memory, there's no
point in running CLUSTER in the first place. And if you run CLUSTER, we
know it's in order. That seems like a pretty safe bet.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-19 16:49:53
Message-ID: 200703191649.l2JGnrs24251@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Sun, 2007-03-11 at 19:06 +0100, Florian G. Pflug wrote:
> > Heikki Linnakangas wrote:
> > > There's a third related term in use as well. When you issue CLUSTER, the
> > > table will be clustered on an index. And that index is then the "index
> > > the table is clustered on". That's a bit cumbersome but that's the
> > > terminology we're using at the moment. Maybe we should to come up with a
> > > new term for that to avoid confusion..
> >
> > This reminds me of something i've been wondering about for quite some
> > time. Why is it that one has to write "cluster <index> on <table>",
> > and not "cluster <table> on <index>"?
> >
> > To me, the second variant would seem more logical, but then I'm
> > not a native english speaker...
> >
> > I'm not suggesting that this should be changed, I'm just wondering
> > why it is the way it is.
>
> No idea, but I agree it conveys exactly the opposite view of what
> happens when the command is issued.

We got the syntax from Berkely, and it has always seemed backwards to me
too.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-03-24 23:12:22
Message-ID: 200703242312.l2ONCMq28736@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO:

o Add more logical syntax CLUSTER table ORDER BY index;
support current syntax for backward compatibility

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Sun, 2007-03-11 at 11:22 +0000, Heikki Linnakangas wrote:
> > Gregory Stark wrote:
> > >> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
> > >>> I've been thinking
> > >>> we should call this feature just Clustered Indexes
> > >
> > > So we would have "clustered tables" which are tables whose heap is ordered
> > > according to an index and separately "clustered indexes" which are indexes
> > > optimized for such tables?
> >
> > Yes, that's what I was thinking.
> >
> > There's a third related term in use as well. When you issue CLUSTER, the
> > table will be clustered on an index. And that index is then the "index
> > the table is clustered on". That's a bit cumbersome but that's the
> > terminology we're using at the moment. Maybe we should to come up with a
> > new term for that to avoid confusion..
>
> First thought: we can use the term "cluster*ing* index" for CLUSTER and
> use the term "clustered" to refer to what has happened to the table and
> the index. That will probably be confused with high availability
> clustering, so perhaps not.
>
> Better thought: say that CLUSTER requires an "order-defining index".
> That better explains the point that it is the table being clustered,
> using the index to define the physical order of the rows in the heap. We
> then use the word "clustered" to refer to what has happened to the
> table, and with this patch, for the index also.
>
> That way we can have new syntax for CLUSTER
>
> CLUSTER table ORDER BY indexname
>
> which is then the preferred syntax, rather than the perverse
>
> CLUSTER index ON table
>
> which gives the wrong impression about what is happening, since it is
> the table that is changed, not the index.
>
> - - -
>
> - Are you suggesting that we have an explicit new syntax
>
> CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...
>
> or just that we refer to this feature as Clustered Indexes?
>
> - Do we still need the index WITH option, in either case?
>
> - Do you think that all Primary Keys should be clustered?
>
> - Are you thinking to rename docs, catalog etc to reflect the new
> naming/meaning?
>
> My thinking would be: CLUSTERED, no, yes, yes
> but I'd like to know what you think?
>
> --
> Simon Riggs
> EnterpriseDB http://www.enterprisedb.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Grouped Index Tuples / Clustered Indexes
Date: 2007-04-02 22:25:48
Message-ID: 200704022225.l32MPmJ10165@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> I've updated the GIT patch at http://community.enterprisedb.com/git/.
> Bitrot caused by the findinsertloc-patch has been fixed, making that
> part of the GIT patch a little bit smaller and cleaner. I also did some
> refactoring, and minor cleanup and commenting.
>
> Any comments on the design or patch? For your convenience, I copied the
> same text I added to access/nbtree/README to
> http://community.enterprisedb.com/git/git-readme.txt
>
> Should we start playing the name game at this point? I've been thinking
> we should call this feature just Clustered Indexes, even though it's not
> exactly the same thing as clustered indexes in other DBMSs. From user
> point of view, they behave similarly enough that it may be best to use
> the existing term.
>
> As a next step, I'm hoping to get the indexam API changes from the
> bitmap index patch committed soon, and in a way that supports GIT as well.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +