Re: Optimizer improvements: to do or not to do?

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Say42 <andrews42(at)yandex(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizer improvements: to do or not to do?
Date: 2006-09-12 09:53:51
Message-ID: 877j09v2a8.fsf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

>> That's what I want to do:
>> 1. Replace not very useful indexCorrelation with indexClustering.
>
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

There's been some previous discussion about how "correlation" was not really
what we wanted to be measuring. But that discussion was in regards to
cross-column "correlation". In that case we're trying to predict how selective
a clause will be. If we read x% of the table due to a restriction on X what
percentage of the values of Y will be represented?

In this case I think we do need to know correlation or something like it.
That's because what we're trying to predict is how close to sequential the i/o
accesses will be. If there's no correlation between index order and disk order
then they'll be random. If they're highly correlated then accesses will be
close to sequential.

It's possible there's some sort of "block-wise correlated" measure which would
be even better for our needs. We don't care if all the high values are towards
the start and low values towards the end as long as each section is in order,
for example.

It's also possible that we could use something like what you describe to
predict how many physical i/os will happen altogether. If the table is highly
clustered but disordered then the io will be random access but the cache will
be more effective than if the table is highly correlated but not clustered
(though it would take a large table to make that possible I think).

In short I think what's needed is someone to review a lot of different stats
metrics for correlation and clustering and do some analysis of how each would
be useful for cost modelling.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2006-09-12 10:10:02 Re: Fixed length data types issue
Previous Message Heikki Linnakangas 2006-09-12 09:43:48 Bitmap index status