Quick Links

Re: estimating # of distinct values

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: estimating # of distinct values
Date:	2011-01-20 21:51:01
Message-ID:	4D38AE45.5070101@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a):
> It seems that the suggested multi-column selectivity estimator would be
> more sensitive to ndistinct of the individual columns. Is that correct?
> How is it biased? If we routinely under-estimate ndistinct of individual
> columns, for example, does the bias accumulate or cancel itself in the
> multi-column estimate?
>
> I'd like to see some testing of the suggested selectivity estimator with
> the ndistinct estimates we have. Who knows, maybe it works fine in
> practice.

The estimator for two columns and query 'A=a AND B=b' is about

0.5 * (dist(A)/dist(A,B) * Prob(A=a) + dist(B)/dist(A,B) * Prob(B=b))

so it's quite simple. It's not that sensitive to errors or ndistinct
estimates for individual columns, but the problem is in the multi-column
ndistinct estimates. It's very likely that with dependent colunms (e.g.
with the ZIP codes / cities) the distribution is so pathological that
the sampling-based estimate will be very off.

I guess this was a way too short analysis, but if you can provide more
details of the expected tests etc. I'll be happy to provide that.

regards
Tomas

In response to

Re: estimating # of distinct values at 2011-01-20 08:10:22 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kevin Grittner	2011-01-20 21:54:36	Re: SSI and Hot Standby
Previous Message	Josh Berkus	2011-01-20 21:48:53	Re: Orphaned statements issue