Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-13 19:22:38
Message-ID: 4D06727E.7030009@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 13.12.2010 18:59, Joshua Tolley napsal(a):
> On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
>> Another quick note: I think that storing the full contingency table is
>> wasteful since the marginals are already stored in the single column
>> statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
>> looking at this a couple years back ).
>
> Josh Tolley still looks at it occasionally, though time hasn't permitted any
> sort of significant work for quite some time. The multicolstat branch on my
> git.postgresql.org repository will create an empirical copula each
> multi-column index, and stick it in pg_statistic. It doesn't yet do anything
> useful with that information, nor am I convinced it's remotely bug-free. In a
> brief PGCon discussion with Tom a while back, it was suggested a good place
> for the planner to use these stats would be clausesel.c, which is responsible
> for handling code such as "...WHERE foo > 4 AND foo > 5".

Well, that's good news ;-)

I've done a bit of research today, and I've found some quite interesting
papers on this topic (probably, I did not have time to read them, in
most cases I've read just the title and abstract).

[1] Selectivity estimators for multidimensional range queries over real
attributes
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.914

This seems like a very good starting point. AFAIK it precisely
describes what data need to be collected, how to do the math etc.

[2] Selectivity Estimation Without the Attribute Value Independence
Assumption
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.8126

This obviously deals with the independence problem. Haven't
investigated it further, but it seems worth to read.

[3] On Analyzing the Cost of Queries with Multi-Attribute Restrictions
and Sort Operations (A Cost Function for Uniformly Partitioned
UB-Trees)
http://mistral.in.tum.de/results/publications/MB00_ideas.pdf

Describes something called UB-Tree, and shows how it may be used to
do estimates. Might be interesting as an alternative to the
traditional histograms.

There are more details about UB-Trees at

http://mistral.in.tum.de/results/publications/

[4] http://www.dbnet.ece.ntua.gr/~nikos/edith/qopt_bibl/

A rather nice collection of papers related to estimation (including
some of the papers listed above).

Hm, I planned to finally read the "Understanding MySQL Internals" over
the Xmas ... that obviously won't happen.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-12-13 19:49:34 Re: CommitFest wrap-up
Previous Message David Fetter 2010-12-13 19:19:07 Re: CommitFest wrap-up