Re: ANALYZE sampling is too good

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 21:14:24
Message-ID: 52A8D5B0.6080704@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/12/13 09:12, Gavin Flower wrote:
> On 12/12/13 08:39, Gavin Flower wrote:
>> On 12/12/13 08:31, Kevin Grittner wrote:
>>> Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
>>>
>>>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
>>>> using 400 byte pages. In the pathologically worst case, assuming
>>>> maximum packing density and no page has both types: the large rows
>>>> would
>>>> occupy 500 pages and the smaller rows 50 pages. So if one selected 11
>>>> pages at random, you get about 10 pages of large rows and about one
>>>> for
>>>> small rows!
>>> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.
>>>
>>> --
>>> Kevin Grittner
>>> EDB: http://www.enterprisedb.com
>>> The Enterprise PostgreSQL Company
>> Sorry, I've simply come up with well argued nonsense!
>>
>> Kevin, you're dead right.
>>
>>
>> Cheers,
>> Gavin
>>
>>
> I looked at:
> http://www.postgresql.org/docs/current/interactive/storage-page-layout.html
>
> this says that each row has an overhead, which suggests there should
> be a bias towards small rows.
>
> There must be a lot of things going on, that I'm simply not aware of,
> that affect sampling bias...
>
>
> Cheers,
> Gavin
>
>
Ignoring overheads per row and other things... There will be a biasing
affect when the distribution of sizes is not symmetric. For example:
when the majority of rows have sizes greater than the arithmetic mean,
then most samples will be biased towards larger rows. Similarly there
could be a bias towards smaller rows when most rows are smaller than the
arithmetic mean. Yes, I did think about this in depth - but it is way
too complicated to attempt to quantify the bias, because it depends on
too many factors (even just limiting it to the distribution of row sizes).

So apart from the nature of volatility of the table, and the pattern of
insertions/updates/deletes - there there will be a bias depending on the
distribution of values in the table.

So I despair, that a simple elegant & practical algorithm will ever be
found.

Therefore, I expect the best answer is probably some kind of empirical
adaptive approach - which I think has already been suggested.

Cheers,
Gavin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-12-11 21:15:41 Re: Reference to parent query from ANY sublink
Previous Message Josh Berkus 2013-12-11 21:10:59 Re: autovacuum_work_mem