Re: ANALYZE sampling is too good

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)ymail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 17:07:31
Message-ID: CAMkU=1ykPnfJUh=S_6GDAfgY9eF=odmkOLYUWwuhgBZeVzrvnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark <stark(at)mit(dot)edu> wrote:

>
> I think we're all wet here. I don't see any bias towards larger or smaller
> rows. Larger tied will be on a larger number of pages but there will be
> fewer of them on any one page. The average effect should be the same.
>
> Smaller values might have a higher variance with block based sampling than
> larger values. But that actually *is* the kind of thing that Simon's
> approach of just compensating with later samples can deal with.
>

I think that looking at all rows in randomly-chosen blocks will not bias
size, or histograms. But it will bias n_distinct and MCV for some data
distributions of data, unless we find some way to compensate for it.

But even for avg size and histograms, what does block sampling get us? We
get larger samples sizes for the same IO, but those samples are less
independent (assuming data is no randomly scattered over the table), so the
"effective sample size" is less than the true sample size. So we can't
just sample 100 time fewer blocks because there are about 100 rows per
block--doing so would not bias our avg size or histogram boundaries, but it
would certainly make them noisier.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2013-12-12 17:11:48 patch: make_timestamp function
Previous Message Tom Lane 2013-12-12 16:55:51 Re: should we add a XLogRecPtr/LSN SQL type?