Quick Links

Re: Equivalent praxis to CLUSTERED INDEX?

Lists:	pgsql-performance

From:	Mischa Sandberg <mischa(dot)sandberg(at)telus(dot)net>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-25 05:28:42
Message-ID:	esVWc.44164$jZ5.31613@clgrps13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Coming from the MSSQL world, I'm used to the first step in optimization
to be, choose your clustered index and choose it well.

I see that PG has a one-shot CLUSTER command, but doesn't support
continuously-updated clustered indexes.

What I infer from newsgroup browsing is, such an index is impossible,
given the MVCC versioning of records (happy to learn I'm wrong).

I'd be curious to know what other people, who've crossed this same
bridge from MSSQL or Oracle or Sybase to PG, have devised,
faced with the same kind of desired performance gain for retrieving
blocks of rows with the same partial key.

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Mischa Sandberg <mischa(dot)sandberg(at)telus(dot)net>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 01:16:17
Message-ID:	412D39E1.2040304@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

> I see that PG has a one-shot CLUSTER command, but doesn't support
> continuously-updated clustered indexes.
>
> What I infer from newsgroup browsing is, such an index is impossible,
> given the MVCC versioning of records (happy to learn I'm wrong).
>
> I'd be curious to know what other people, who've crossed this same
> bridge from MSSQL or Oracle or Sybase to PG, have devised,
> faced with the same kind of desired performance gain for retrieving
> blocks of rows with the same partial key.

I just run clusterdb each weekend in a cron job...

Chris

From:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Cc:	mischa(dot)sandberg(at)telus(dot)net
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 18:14:32
Message-ID:	1093544071.349.106.camel@vulture.corp.neopolitan.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, 2004-08-24 at 22:28, Mischa Sandberg wrote:
> I see that PG has a one-shot CLUSTER command, but doesn't support
> continuously-updated clustered indexes.
>
> What I infer from newsgroup browsing is, such an index is impossible,
> given the MVCC versioning of records (happy to learn I'm wrong).

It is possible to have MVCC and ordered/indexed heaps, but it isn't
something you can just tack onto the currently supported types -- I
looked into this myself. It would take substantial additional code
infrastructure to support it, basically an alternative heap system and
adding support for tables with odd properties to many parts of the
system. Pretty non-trivial.

This is probably my #1 "I wish postgres had this feature" feature. It
is a serious scalability enhancer for big systems and a pain to work
around not having.

> I'd be curious to know what other people, who've crossed this same
> bridge from MSSQL or Oracle or Sybase to PG, have devised,
> faced with the same kind of desired performance gain for retrieving
> blocks of rows with the same partial key.

The CLUSTER command is often virtually useless for precisely the kinds
of tables that need to be clustered. My databases are on-line 24x7, and
the tables that are ideal candidates for clustering are in the range of
50-100 million rows. I can afford to lock these tables up for no more
than 5-10 minutes during off-peak in the hopes that no one notices, and
CLUSTER does not work remotely in the ballpark of that fast for tables
of that size. People who can run CLUSTER in a cron job must either have
relatively small tables or regular large maintenance windows.

My solution, which may or may not work for you, was to write a table
partitioning system using the natural flexibility and programmability of
postgresql (e.g. table inheritance). From this I automatically get a
roughly ordered heap according to the index I would cluster on, with
only slightly funky SQL access. The end result works much better with
CLUSTER too, though CLUSTER is much less necessary at that point
because, at least for my particular purposes, the rows are mostly
ordered due to how the data was partitioned.

So there are ways to work around CLUSTER, but you'll have to be clever
and it will require tailoring the solution to your particular
requirements.

J. Andrew Rogers

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org, mischa(dot)sandberg(at)telus(dot)net
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 18:18:53
Message-ID:	200408261818.i7QIIrr09640@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

How do vendors actually implement auto-clustering? I assume they move
rows around during quiet periods or have lots of empty space in each
value bucket.

---------------------------------------------------------------------------

J. Andrew Rogers wrote:
> On Tue, 2004-08-24 at 22:28, Mischa Sandberg wrote:
> > I see that PG has a one-shot CLUSTER command, but doesn't support
> > continuously-updated clustered indexes.
> >
> > What I infer from newsgroup browsing is, such an index is impossible,
> > given the MVCC versioning of records (happy to learn I'm wrong).
>
>
> It is possible to have MVCC and ordered/indexed heaps, but it isn't
> something you can just tack onto the currently supported types -- I
> looked into this myself. It would take substantial additional code
> infrastructure to support it, basically an alternative heap system and
> adding support for tables with odd properties to many parts of the
> system. Pretty non-trivial.
>
> This is probably my #1 "I wish postgres had this feature" feature. It
> is a serious scalability enhancer for big systems and a pain to work
> around not having.
>
>
> > I'd be curious to know what other people, who've crossed this same
> > bridge from MSSQL or Oracle or Sybase to PG, have devised,
> > faced with the same kind of desired performance gain for retrieving
> > blocks of rows with the same partial key.
>
>
> The CLUSTER command is often virtually useless for precisely the kinds
> of tables that need to be clustered. My databases are on-line 24x7, and
> the tables that are ideal candidates for clustering are in the range of
> 50-100 million rows. I can afford to lock these tables up for no more
> than 5-10 minutes during off-peak in the hopes that no one notices, and
> CLUSTER does not work remotely in the ballpark of that fast for tables
> of that size. People who can run CLUSTER in a cron job must either have
> relatively small tables or regular large maintenance windows.
>
>
> My solution, which may or may not work for you, was to write a table
> partitioning system using the natural flexibility and programmability of
> postgresql (e.g. table inheritance). From this I automatically get a
> roughly ordered heap according to the index I would cluster on, with
> only slightly funky SQL access. The end result works much better with
> CLUSTER too, though CLUSTER is much less necessary at that point
> because, at least for my particular purposes, the rows are mostly
> ordered due to how the data was partitioned.
>
> So there are ways to work around CLUSTER, but you'll have to be clever
> and it will require tailoring the solution to your particular
> requirements.
>
>
> J. Andrew Rogers
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, "J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org, mischa(dot)sandberg(at)telus(dot)net
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 18:32:30
Message-ID:	200408261132.30732.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Bruce,

> How do vendors actually implement auto-clustering? I assume they move
> rows around during quiet periods or have lots of empty space in each
> value bucket.

That's how SQL Server does it. In old versions (6.5) you had to manually
send commands to update the cluster, same as PG. Also, when you create a
cluster (or an index or table for that matter) you can manually set an amount
of "space" to be held open on each data page for updates.

Also keep in mind that SQL Server, as a "single-user database" has a much
easier time with this. They don't have to hold several versions of an index
in memory and collapse it into a single version at commit time.

All that being said, we could do a better job of "auto-balancing" clustered
tables. I believe that someone was working on this in Hackers through what
they called "B-Tree Tables". What happened to that?

--
Josh Berkus
Aglio Database Solutions
San Francisco

From:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-performance(at)postgresql(dot)org, mischa(dot)sandberg(at)telus(dot)net
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 19:04:48
Message-ID:	1093547088.349.134.camel@vulture.corp.neopolitan.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, 2004-08-26 at 11:18, Bruce Momjian wrote:
> How do vendors actually implement auto-clustering? I assume they move
> rows around during quiet periods or have lots of empty space in each
> value bucket.

As far as I know, Oracle does it by having a B-Tree organized heap (a
feature introduced around v8 IIRC), basically making the primary key
index and the heap the same physical structure. Any non-index columns
are stored in the index along with the index columns. Implementing it
is slightly weird because searching the index and selecting the rows
from the heap are not separate operations.

The major caveat to having tables of this type is that you can only have
a primary key index. No other indexes are possible because the "heap"
constantly undergoes local reorganizations if you have a lot of write
traffic, the same kind of reorganization you would normally expect in a
BTree index.

The performance improvements come from two optimizations. First, you
have to touch significantly fewer blocks to get all the rows, even
compared to a CLUSTERed heap. Second, the footprint is smaller and
plays nicely with the buffer cache.

When I've used these types of heaps in Oracle 8 on heavily used tables
with tens of millions of rows, we frequently got a 10x or better
performance improvement on queries against those tables. It is only
really useful for tables with vast quantities of relatively small rows,
but it can be a lifesaver in those cases.

J. Andrew Rogers

From:	Mischa Sandberg <ischamay(dot)andbergsay(at)activestateway(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 21:00:47
Message-ID:	3csXc.56326$X12.25148@edtnps84
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Ummm ... not quite. In MSSQL/Sybase/Oracle, a clustered index maintains
its space saturation as part of each update operation. High activity
does indeed result in less-full pages (typically 60-80% full for tables
with heavy deletions or rowsize changes). To bring the percentage back
up, you run DBCC INDEXDEFRAG, which also does what you'd expect of a
normal file defragmenter -- put related disk pages together on the platter.

But the performance difference is hardly as severe as I gather it can be
if you neglect to vacuum.

As for SQL Server being a 'single-user database' ... ummm ... no, I
don't think so. I'm REALLY happy to be shut of the Microsoft world, but
MSSQL 7/2000/2005 is a serious big DB engine. It also has some serious
bright heads behind it. They hired Goetz Graefe and Paul (aka Per-Ake)
Larsen away from academia, and it shows, in the join and aggregate
processing. I'll be a happy camper if I manage to contribute something
to PG that honks the way their stuff does. Happy to discuss, too.

Josh Berkus wrote:
> Bruce,
>
>
>>How do vendors actually implement auto-clustering? I assume they move
>>rows around during quiet periods or have lots of empty space in each
>>value bucket.
>
>
> That's how SQL Server does it. In old versions (6.5) you had to manually
> send commands to update the cluster, same as PG. Also, when you create a
> cluster (or an index or table for that matter) you can manually set an amount
> of "space" to be held open on each data page for updates.
>
> Also keep in mind that SQL Server, as a "single-user database" has a much
> easier time with this. They don't have to hold several versions of an index
> in memory and collapse it into a single version at commit time.
>
> All that being said, we could do a better job of "auto-balancing" clustered
> tables. I believe that someone was working on this in Hackers through what
> they called "B-Tree Tables". What happened to that?
>

From:	Gaetano Mendola <mendola(at)bigfoot(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 21:09:35
Message-ID:	412E518F.8090805@bigfoot.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Bruce Momjian wrote:
> How do vendors actually implement auto-clustering? I assume they move
> rows around during quiet periods or have lots of empty space in each
> value bucket.
>
> ---------------------------------------------------------------------------

IIRC informix doesn't have it, and you have to recluster periodically
the table. After having clustered the table with an index in order to
recluster the table with another index you have to release the previous
one ( ALTER index TO NOT CLUSTER ), the CLUSTER is an index attribute and
each table can have only one index with that attribute ON.

Regards
Gaetano Mendola

From:	Mischa Sandberg <ischamay(dot)andbergsay(at)activestateway(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-26 21:46:13
Message-ID:	FSsXc.56332$X12.9135@edtnps84
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Sheer nitpick here...

A B-tree is where the records (data) live at all levels of the tree;
B+ tree is where the records are only at the leaf level.
That's what Knuth calls them, anyway.

Clustered indexes for all known dbs are true B+ trees.
Nonclustered indexes could be B-trees (probably aren't),
since there's no big fanout penalty for storing the little
(heap) row locators everywhere at all levels.

J. Andrew Rogers wrote:
> As far as I know, Oracle does it by having a B-Tree organized heap (a
> feature introduced around v8 IIRC), basically making the primary key
> index and the heap the same physical structure.
...

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	ischamay(dot)andbergsay(at)activestateway(dot)com
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-29 18:59:01
Message-ID:	200408291159.01205.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Mishca,

> Ummm ... not quite. In MSSQL/Sybase/Oracle, a clustered index maintains
> its space saturation as part of each update operation. High activity
> does indeed result in less-full pages (typically 60-80% full for tables
> with heavy deletions or rowsize changes). To bring the percentage back
> up, you run DBCC INDEXDEFRAG, which also does what you'd expect of a
> normal file defragmenter -- put related disk pages together on the platter.

Sure, it does now, which is a nice thing. It didn't in the first version
(6.5) where this cluster maint needed to be done manually and asynchronously,
as I recall.

> As for SQL Server being a 'single-user database' ... ummm ... no, I
> don't think so.

Hmmm ... perhaps it would be better if I said "serial-only database". MSSQL
(like earlier versions of Sybase) handles transactions by spooling everything
out to a serial log, effectively making all transcations SERIAL isolation.
This has some significant benefits in performance for OLAP and data
warehousing, but really kills you on high-concurrency transaction processing.

> I'm REALLY happy to be shut of the Microsoft world, but
> MSSQL 7/2000/2005 is a serious big DB engine. It also has some serious
> bright heads behind it. They hired Goetz Graefe and Paul (aka Per-Ake)
> Larsen away from academia, and it shows, in the join and aggregate
> processing. I'll be a happy camper if I manage to contribute something
> to PG that honks the way their stuff does. Happy to discuss, too.

Yeah, they also have very speedy cursor handling. I can do things with
cursors in MSSQL that I wouldn't consider with other DBs. Not that that
makes up for the lack of other functionality, but it is nice when you need
it.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, mischa(dot)sandberg(at)telus(dot)net
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-08-31 21:50:13
Message-ID:	20040831215013.GJ78395@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, Aug 26, 2004 at 12:04:48PM -0700, J. Andrew Rogers wrote:
> The major caveat to having tables of this type is that you can only have
> a primary key index. No other indexes are possible because the "heap"
> constantly undergoes local reorganizations if you have a lot of write
> traffic, the same kind of reorganization you would normally expect in a
> BTree index.

This isn't true, at least in 9i. You can create whatever indexes you
want on an index-organized table. I believe that the index stores the PK
value instead of the ROWID.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Mischa Sandberg <ischamay(dot)andbergsay(at)activestateway(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Equivalent praxis to CLUSTERED INDEX?
Date:	2004-09-08 15:38:51
Message-ID:	fIF%c.159841$X12.48929@edtnps84
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Mischa Sandberg wrote:
> Coming from the MSSQL world, I'm used to the first step in optimization
> to be, choose your clustered index and choose it well.
> I see that PG has a one-shot CLUSTER command, but doesn't support
> continuously-updated clustered indexes.
> What I infer from newsgroup browsing is, such an index is impossible,
> given the MVCC versioning of records (happy to learn I'm wrong).
> I'd be curious to know what other people, who've crossed this same
> bridge from MSSQL or Oracle or Sybase to PG, have devised,
> faced with the same kind of desired performance gain for retrieving
> blocks of rows with the same partial key.

Just to let people know, after trying various options, this looks the
most promising:

- segment the original table into four tables (call them A,B,C,D)

- all insertions go into A.
- longterm data lives in B.

- primary keys of all requests to delete rows from (B) go into D -- no
actual deletions are done against B. Deletions against A happen as normal.

- all queries are made against a view: a union of A and B and (not
exists) D.

- daily merge A,B and (where not exists...) D, into C
- run cluster on C, then swap names on B and C, truncate A and D.

Not rocket science, but it seems to give the payback of normal
clustering without locking the table for long periods of time. It also
saves on VACUUM FULL time.

At present, we're only at 1M rows in B on this. More when I know it.
Advance warning on any gotchas with this approach would be much
appreciated. Making a complete copy of (B) is a bit of an ouch.