Re: ANALYZE sampling is too good

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 06:34:52
Message-ID: CA+U5nMJS8CLED0FLif+5cj_XGm8Q4e7h6qTeDJu=rRsnojF0wQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11 December 2013 01:27, Sergey E. Koposov <math(at)sai(dot)msu(dot)ru> wrote:
> For what it's worth.
>
> I'll quote Chaudhuri et al. first line from the abstract about the block
> sampling.
> "Block-level sampling is far more efficient than true uniform-random
> sampling over a large database, but prone to significant errors if used to
> create database statistics."

This glosses over the point that both SQLServer and Oracle use this technique.

> And after briefly glancing through the paper, my opinion is why it works is
> because after making one version of statistics they cross-validate, see how
> well it goes and then collect more if the cross-validation error is large
> (for example because the data is clustered). Without this bit, as far as I
> can a simply block based sampler will be bound to make catastrophic mistakes
> depending on the distribution

I don't think its true that a block based sampler will be *bound* to
make "catastrophic mistakes". They can clearly happen, just as they
can with random samples, hence the need for a parameter to control the
sample with a parameter.

Realistically, I never heard of an Oracle DBA doing advanced
statistical mathematics before setting the sample size on ANALYZE. You
use the default and bump it up if the sample is insufficient for the
data.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message KONDO Mitsumasa 2013-12-11 06:36:37 Re: Time-Delayed Standbys
Previous Message Dev Kumkar 2013-12-11 06:31:51 Case sensitivity