Re: ANALYZE sampling is too good

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 17:22:51
Message-ID: 18246.1386782571@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Hm. You can only take N rows from a block if there actually are at least
> N rows in the block. So the sampling rule I suppose you are using is
> "select up to N rows from each sampled block" --- and that is going to
> favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards. (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows. You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages. However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows. This explains why you saw smaller
values of "i" being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again. But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message MauMau 2013-12-11 17:38:09 Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
Previous Message Tom Lane 2013-12-11 17:09:58 Re: Why the buildfarm is all pink