Re: ANALYZE sampling is too good

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 02:31:33
Message-ID: 52A3DA05.7040506@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/12/13 10:27, Greg Stark wrote:
> On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
> This is nonsense. The math behind the deductions the planner makes is
> well understood and we know how to estimate the precision based on the
> sample size. There are some questions that really need a proportional
> sample such as n_distinct (and as Robert points out the MCV list) but
> the main motivation of the sample size is the histograms and those are
> used to answer questions which very clearly do not need a proportional
> sample. The statistics is very clear there. Using a proportional
> sample for the histograms would just be silly. It would be
> substituting a gut feeling for real statistics.
>
> The problems with using a proportional sample for things like
> n_distinct or the MCV list is that you're talking about sorting or
> hashing an unboundedly large set of values and storing an unboundedly
> large array in the statistics table. I don't think that would be
> tenable without dramatically changing the way process and store that
> data to be more scalable. Using a lossy counting algorithm and
> something more scalable than toasted arrays would be prerequisites I
> think.
>
> And as Robert mentions even if we solved those problems it wouldn't
> help n_distinct. You really need to read all the values to deal with
> n_distinct.
>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables. We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
> This would be helpful if you could specify what you mean by
> "black-based sampling". The reason these previous pleas didn't go
> anywhere is not because we can't do math. It's because of the lack of
> specifics here. What we do *today* is called "block-based sampling" by
> the literature.
>
> What I'm saying is we need to change the "block-based sampling" that
> we do to extract more rows per block. We currently extract the
> "correct" number of rows to get a strong guarantee of uniformity. If
> we could get a weaker guarantee of being "close enough" to uniform
> samples for the deductions we want to make and get many more rows per
> block then we could read a lot fewer blocks.
>
> Or to put it another way people could increase the statistics target
> dramatically and still be reading the same number of blocks as today.
> In an ideal world perhaps we could have people reading 1% of the
> blocks they read today and get statistics targets 10x better than
> today.
>
I suspect that the number of rows to sample should be something of the form:

{ N where N <= a * 100
n = { (a * N) / 100 where a * 100 < N <= 10000
{ (a * SQRT(N)) / 100 where N > 10000

n: the number of rows to sample
N: total number of rows
a: configurable coefficient (1 <= a <= 100)

For very large numbers of rows, like 10^8, taking a constant fraction
will over sample for most purposes, hence the square root.

a N n sampled%
1 10000 100 1%
100 10000 10000 100%
1 10^8 10000 0.01%
100 10^8 10^6 1%
1 10^10 10^5 0.001%
100 10^10 10^7 0.01%

For very large values of N, you can get can still get a representative
sample with a very small fraction of rows sampled.

Hmm... Yes, I can see some issues, once I tried out with test values!
However. it might inspire a more useful and practical approach!

Cheers,
Gavin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message MauMau 2013-12-08 02:41:12 Re: [bug fix] pg_ctl always uses the same event source
Previous Message Michael Paquier 2013-12-08 02:15:57 Re: dblink performance regression