Re: ANALYZE sampling is too good

From: Greg Stark <stark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 21:27:39
Message-ID: CAM-w4HOQfAVvUMYCXOfmkZwg=8ye9t7jpqYBdyE7+ny-hZtZqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> The only approach which makes sense is to base it on a % of the table.
> In fact, pretty much every paper which has examined statistics
> estimation for database tables has determined that any estimate based on
> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
> samples are 100% accurate, but at least they fit the 80/20 rule.

This is nonsense. The math behind the deductions the planner makes is
well understood and we know how to estimate the precision based on the
sample size. There are some questions that really need a proportional
sample such as n_distinct (and as Robert points out the MCV list) but
the main motivation of the sample size is the histograms and those are
used to answer questions which very clearly do not need a proportional
sample. The statistics is very clear there. Using a proportional
sample for the histograms would just be silly. It would be
substituting a gut feeling for real statistics.

The problems with using a proportional sample for things like
n_distinct or the MCV list is that you're talking about sorting or
hashing an unboundedly large set of values and storing an unboundedly
large array in the statistics table. I don't think that would be
tenable without dramatically changing the way process and store that
data to be more scalable. Using a lossy counting algorithm and
something more scalable than toasted arrays would be prerequisites I
think.

And as Robert mentions even if we solved those problems it wouldn't
help n_distinct. You really need to read all the values to deal with
n_distinct.

> This is the reason why implementing block-based sampling is critical;
> using our current "take one row out of every page" method, sampling 5%
> of the table means scanning the whole thing in most tables. We also
> need to decouple the number of MCVs we keep from the sample size.
> Certainly our existing sampling algo seems designed to maximize IO for
> the sample size.

This would be helpful if you could specify what you mean by
"black-based sampling". The reason these previous pleas didn't go
anywhere is not because we can't do math. It's because of the lack of
specifics here. What we do *today* is called "block-based sampling" by
the literature.

What I'm saying is we need to change the "block-based sampling" that
we do to extract more rows per block. We currently extract the
"correct" number of rows to get a strong guarantee of uniformity. If
we could get a weaker guarantee of being "close enough" to uniform
samples for the deductions we want to make and get many more rows per
block then we could read a lot fewer blocks.

Or to put it another way people could increase the statistics target
dramatically and still be reading the same number of blocks as today.
In an ideal world perhaps we could have people reading 1% of the
blocks they read today and get statistics targets 10x better than
today.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2013-12-07 21:34:00 Re: Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
Previous Message Thom Brown 2013-12-07 20:50:12 Re: Index overhead cost reporting