Re: ANALYZE sampling is too good

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 13:02:30
Message-ID: 52A86266.9020700@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/11/2013 02:08 PM, Greg Stark wrote:
> On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> I'm not actually sure there is any systemic bias here. The larger
>> number of rows per block generate less precise results but from my
>> thought experiments they seem to still be accurate?
>
> So I've done some empirical tests for a table generated by:
> create table sizeskew as (select i,j,repeat('i',i) from
> generate_series(1,1000) as i, generate_series(1,1000) as j);
>
> I find that using the whole block doesn't cause any problem with the
> avg_width field for the "repeat" column.That does reinforce my belief
> that we might not need any particularly black magic here.

How large a sample did you use? Remember that the point of doing
block-level sampling instead of the current approach would be to allow
using a significantly smaller sample (in # of blocks), and still achieve
the same sampling error. If the sample is "large enough", it will mask
any systemic bias caused by block-sampling, but the point is to reduce
the number of sampled blocks.

The practical question here is this: What happens to the quality of the
statistics if you only read 1/2 the number of blocks than you normally
would, but included all the rows in the blocks we read in the sample?
How about 1/10 ?

Or to put it another way: could we achieve more accurate statistics by
including all rows from the sampled rows, while reading the same number
of blocks? In particular, I wonder if it would help with estimating
ndistinct. It generally helps to have a larger sample for ndistinct
estimation, so it might be beneficial.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-12-11 13:13:18 Re: should we add a XLogRecPtr/LSN SQL type?
Previous Message Robert Haas 2013-12-11 12:53:28 Re: Extra functionality to createuser