Re: multi-column index

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Brown <time(at)bigpond(dot)net(dot)au>, pgsql-performance(at)postgresql(dot)org
Subject: Re: multi-column index
Date: 2005-03-18 10:42:23
Message-ID: 968l3192dup7lukgkeaggoj1nlbdq8q3up@email.aon.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, 17 Mar 2005 13:15:32 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>I am coming around to the view that we really do need to calculate
>index-specific correlation numbers,

Correlation is a first step. We might also want distribution
information like number of distinct index tuples and histograms.

>Now, as to the actual mechanics of getting the numbers: the above link
>seems to imply reading the whole index in index order.

That turned out to be surprisingly easy (no need to look at data values,
no operator lookup, etc.) to implement as a proof of concept. As it's
good enough for my use cases I never bothered to change it.

> Which is a
>hugely expensive proposition for a big index,

Just a thought: Could the gathering of the sample be integrated into
the bulk delete phase of VACUUM? (I know, ANALYSE is not always
performed as an option to VACUUM, and VACUUM might not even have to
delete any index tuples.)

> We need a way
>to get the number from a small sample of pages.

I had better (or at least different) ideas at that time, like walking
down the tree, but somehow lost impetus :-(

>The idea I was toying with was to recalculate the index keys for the
>sample rows that ANALYZE already acquires, and then compare/sort
>those.

This seems to be the approach that perfectly fits into what we have now.

> This is moderately expensive CPU-wise though, and it's also not
>clear what "compare/sort" means for non-btree indexes.

Nothing. We'd need some notion of "clusteredness" instead of
correlation. C.f. my answer to Ron in this thread.

BTW, the more I think about it, the more I come to the conclusion that
when the planner starts to account for "clusteredness", random page cost
has to be raised.

Servus
Manfred

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Brian O'Reilly 2005-03-18 23:21:02 BUG #1552: massive performance hit between 7.4 and 8.0.1
Previous Message Manfred Koizar 2005-03-18 10:34:03 Re: multi-column index