Re: The need for clustered indexes to boost TPC-V performance

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Reza Taheri <rtaheri(at)vmware(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Andy Bond (abond(at)redhat(dot)com)" <abond(at)redhat(dot)com>, Greg Kopczynski <gregwk(at)vmware(dot)com>, Jignesh Shah <jshah(at)vmware(dot)com>
Subject: Re: The need for clustered indexes to boost TPC-V performance
Date: 2012-07-04 14:26:36
Message-ID: CAGTBQpY4J2on5=gN5-ykSpZkxVeGHAfiZwu0mXpqsSXwoxxnPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

On Tue, Jul 3, 2012 at 8:13 PM, Reza Taheri <rtaheri(at)vmware(dot)com> wrote:
> So I looked more closely at the indexes. I chose the CASH_TRANSACTION
> table since it has a single index, and we can compare it more directly to the
> Dell data. If you look at page 34 of http://bit.ly/QeWXhE, the index size of CT
> is 1,278,720KB for 6,120,529,488 rows. That’s less than one byte of index
> per data row! How could that be? Well, MS SQL used a “clustered index”
> for CT, i.e., the data is held in the leaf pages of the index B-Tree.
> The data and index are in one data structure. Once you lookup the index,
> you also have the data at zero additional cost. For PGSQL, we had to create
> a regular index, which took up 55GB. Once you do the math, this works out
> to around 30 bytes per row. I imagine we have the 15-byte key along with a
> couple of 4-byte or 8-byte pointers.
...
> So MS SQL beats PGSQL by a) having a lower I/O rate due to no competition
> for the buffer pool from indexes (except for secondary indexes); and b) by
> getting the data with a free lookup, whereas we have to work our way down
> both the index and the data trees.

15-byte key?

What about not storing the keys, but a hash, for leaf nodes?

Assuming it can be made to work for both "range" and "equality" scans,
holding only hashes on leaf nodes would reduce index size, but how
much?

I think it's doable, and I could come up with a spec if it's worth it.
It would have to scan the heap for only two extra index pages (the
extremes that cannot be ruled out) and hash collisions, which doesn't
seem like a big loss versus the reduced index.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Josh Kupershmidt 2012-07-04 18:33:04 Re: TG_COLUMNS_UPDATED
Previous Message Bartosz Dmytrak 2012-07-04 14:20:01 Re: View parsing

Browse pgsql-performance by date

  From Date Subject
Next Message Reza Taheri 2012-07-04 18:24:08 Re: Introducing the TPC-V benchmark, and its relationship to PostgreSQL
Previous Message Daniel Farina 2012-07-04 13:39:38 Re: The need for clustered indexes to boost TPC-V performance