Skip site navigation (1) Skip section navigation (2)

Peripheral Links

Header And Logo

PostgreSQL
| The world's most advanced open source database.

Site Navigation

Search for
  Advanced Search

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?


  • From: Josh Berkus <josh(at)agliodbs(dot)com>
  • To: Andrew Dunstan <andrew(at)dunslane(dot)net>
  • Cc: gsql-perform <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
  • Subject: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
  • Date: Sun, 24 Apr 2005 11:30:50 -0700
  • Message-id: <200504241130(dot)50218(dot)josh(at)agliodbs(dot)com>

Andrew,

> The math in the paper does not seem to look at very low levels of q (=
> sample to pop ratio).

Yes, I think that's the failing.   Mind you, I did more testing and found out 
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy 
(which I would consider acceptable) with a sample size of 25% or more (which 
is infeasable in any large table).    The formula does work for populations 
where D/N is much lower, say 0.01.  So overall it seems to only work for 1/4 
of cases; those where n/N is large and D/N is low.   And, annoyingly, that's 
probably the population where accurate estimation is least crucial, as it 
consists mostly of small tables.

I've just developed (not original, probably, but original to *me*) a formula 
that works on populations where n/N is very small and D/N is moderate (i.e. 
0.1 to 0.4):

N * (d/n)^(sqrt(N/n))

However, I've tested it only on (n/N < 0.005 and D/N > 0.1 and D/N < 0.4) 
populations, and only 3 of them to boot.   I'd appreciate other people trying 
it on their own data populations, particularly very different ones, like D/N 
> 0.7 or D/N < 0.01.

Further, as Andrew points out we presumably do page sampling rather than 
purely random sampling so I should probably read the paper he referenced.  
Working on it now ....

-- 
Josh Berkus
Aglio Database Solutions
San Francisco



Home | Main Index | Thread Index

Privacy Policy | PostgreSQL Archives hosted by Command Prompt, Inc. | Designed by tinysofa
Copyright © 1996 – 2008 PostgreSQL Global Development Group