Re: ANALYZE sampling is too good

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 14:39:36
Message-ID: 34A68E9C-6716-4A01-BFCF-2815F0B738D8@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
> create table baz2 as select * from baz order by floor;
> create table baz3 as select * from baz order by md5(floor::text);
>
> baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.
>
> After analyzing all of them:
>
> select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ;
> tablename | n_distinct | correlation
> -----------+-------------+-------------
> baz | 8.56006e+06 | 0.00497713
> baz2 | 333774 | 1
> baz3 | 361048 | -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then

tablename | n_distinct (est) | n_distinct (act)
-----------+------------------+------------------
baz | 391685 | 500000
baz2 | 36001 | 500000

ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.

So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-12-12 14:52:19 Re: Reference to parent query from ANY sublink
Previous Message Marko Kreen 2013-12-12 14:32:07 Re: SSL: better default ciphersuite