Skip site navigation (1) Skip section navigation (2)

Peripheral Links

Header And Logo

PostgreSQL
| The world's most advanced open source database.

Site Navigation

Search for
  Advanced Search

Re: Feature suggestion : FAST CLUSTER



On May 27, 2007, at 12:34 PM, PFC wrote:
On Sun, 27 May 2007 17:53:38 +0200, Jim C. Nasby <decibel(at)decibel(dot)org> wrote:
On Tue, May 22, 2007 at 09:29:00AM +0200, PFC wrote:
This does not run a complete sort on the table. It would be about as fast as your seq scan disk throughput. Obviously, the end result is not as good as a real CLUSTER since the table will be made up of several ordered chunks and a range lookup. Therefore, a range lookup on the clustered columns would need at most N seeks, versus 1 for a really clustered table. But it only scans the table once and writes it once, even counting index
rebuild.

Do you have any data that indicates such an arrangement would be
substantially better than less-clustered data?
While the little benchmark that will answer your question is running, I'll add a few comments :

I have been creating a new benchmark for PostgreSQL and MySQL, that I will call the Forum Benchmark. It mimics the activity of a forum. So far, I have got interesting results about Postgres and InnoDB and will publish an extensive report with lots of nasty stuff in it, in, say, 2 weeks, since I'm doing this in spare time.

Anyway, forums like clustered tables, specifically clusteriing posts on (topic_id, post_id), in order to be able to display a page with one disk seek, instead of one seek per post. PostgreSQL humiliates InnoDB on CPU-bound workloads (about 2x faster since I run it on dual core ; InnoDB uses only one core). However, InnoDB can automatically cluster tables without maintenance. This means InnoDB will, even though it sucks and is awfully bloated, run a lot faster than postgres if things become IO- bound, ie. if the dataset is larger than RAM. Postgres needs to cluster the posts table in order to keep going. CLUSTER is very slow. I tried inserting into a new posts table, ordering by (post_id, topic_id), then renaming the new table in place of the old. It is faster, but still slow when handling lots of data. I am trying other approaches, some quite hack-ish, and will report my findings.

I assume you meant topic_id, post_id. :)

The problem with your proposal is that it does nothing to ensure that posts for a topic stay together as soon as the table is large enough that you can't sort it in a single pass. If you've got a long-running thread, it's still going to get spread out throughout the table.

What you really want is CLUSTER CONCURRENTLY, which I believe is on the TODO list. BUT... there's another caveat here: for any post where the row ends up being larger than 2k, the text is going to get TOASTed anyway, which means it's going to be in a separate table, in a different ordering. I don't know of a good way to address that; you can cluster the toast table, but you'll be clustering on an OID, which isn't going to help you.
--
Jim Nasby                                            jim(at)nasby(dot)net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)





Home | Main Index | Thread Index

Privacy Policy | PostgreSQL Archives hosted by Command Prompt, Inc. | Designed by tinysofa
Copyright © 1996 – 2008 PostgreSQL Global Development Group