Quick Links

Re: proposal : cross-column stats

From:	Nathan Boley <npboley(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal : cross-column stats
Date:	2010-12-13 03:10:44
Message-ID:	AANLkTinuY9PKxX_f1zHcifjbTL3ZPhyQOTQ+tLqs2iMx@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?
>
> Well, I'm not real familiar with contingency tables, but it seems like
> you could end up needing to store a huge amount of data to get any
> benefit out of it, in some cases. For example, in the United States,
> there are over 40,000 postal codes, and some even larger number of
> city names, and doesn't the number of entries go as O(m*n)?

Not to mention that the model parameters will be impossible to estimate well.

( I've only scanned the thread, so sorry if Im repeating something
that's already been said )

My intuition is that storing the covariance structure will be
unworkable both technically and statistically for all but the simplest
of cases. That being said, I dont think the problem is a lack of ways
to parameterize the covariance structure ( there are several papers on
mutli-dim histogram estimators, at least one of whichtalks about
databases explicitly, not to mention approaches like CART[1] ) , but a
complete lack of infrastructure to do anything with the estimates. So
keep working on it ;-) ( but try to make the framework general enough
to allow better estimators ).

I wonder if a good first step might be to focus on the AND and OR
operators since they seem like the most common cases and union and
intersection commute ( although it's important to remember that the
estimate variances do NOT ) That is, what about estimating the
selectivity of the condition

WHERE X=A and Y=B

f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) +
x_2*selectivity(X = A)*selectivity( Y = B ) + x_3

where x_{1,2,3} are parameters to be estimated from the data.

Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
looking at this a couple years back ).

Best,
Nathan

[1] http://en.wikipedia.org/wiki/Classification_and_regression_tree
[2] http://en.wikipedia.org/wiki/Copula_%28statistics%29

In response to

Re: proposal : cross-column stats at 2010-12-13 02:28:48 from Robert Haas

Responses

Re: proposal : cross-column stats at 2010-12-13 17:59:01 from Joshua Tolley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2010-12-13 03:20:52	Re: Instrument checkpoint sync calls
Previous Message	Fujii Masao	2010-12-13 03:06:01	Re: PS display and standby query conflict