Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-13 22:45:36
Message-ID: 4D06A210.8070009@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 13.12.2010 22:50, Josh Berkus napsal(a):
> Tomas,
>
>> (a) find out what statistics do we need to collect and how to use it
>> (b) implement a really stupid inefficient solution
>> (c) optimize in iterations, i.e. making it faster, consuming less
>> space etc.
>
> I'll suggest again how to decide *which* columns to cross: whichever
> columns are combined in composite indexes. In version 2, allow the DBA
> to specify combinations.
>
> In the unlikely event that correlation could be reduced to a single
> float number, it would be conceivable for each column to have an array
> of correlation stats for every other column where correlation was
> non-random; on most tables (i.e. ones with less than 100 columns) we're
> not talking about that much storage space.
>
> The main cost would be the time spent collecting that info ...

I think this is a bit early to discuss this, given the fact that we
don't have a working solution yet. But OK, let's discuss these options
anyway

1) collecting the automatically for composite indexes

I don't think this is wise idea. The first versions definitely won't
be very efficient, and collecting the data for each composite
index means everyone will be hit by this inefficiency, even if he
actually does not need that (e.g. the columns are independent so the
current estimates are quite accurate or he's not using those columns
very often in the same WHERE clause).

Another reason against this is that many DBAs don't actually use
composed indexes - they simply create indexes on each column and let
the bitmap index scan to work it out. And this would not work for
this case.

And actually it's not very complicated to allow the DBA to do this,
this can be a quite simple PL/pgSQL procedure.

2) collecting correlation for each pair of columns

Again, you're effectively forcing everyone to pay the price even
though he may not need the feature. Maybe we'll get there one day,
but it's not a good idea to do that from the beginning.

And the correlation itself has a very limited use in real life, as
it's not possible to compute it for character columns and is not
very useful in case of some numerical columns (e.g. ZIP codes).

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2010-12-13 22:48:15 Re: proposal : cross-column stats
Previous Message Pavel Stehule 2010-12-13 22:42:54 Re: hstores in pl/python