Re: TABLESAMPLE patch

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-04-10 20:16:16
Message-ID: 55282F90.5070106@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/10/15 21:57, Petr Jelinek wrote:
> On 10/04/15 21:26, Peter Eisentraut wrote:
>
> But this was not really my point, the BERNOULLI just does not work
> well with row-limit by definition, it applies probability on each
> individual row and while you can get probability from percentage very
> easily (just divide by 100), to get it for specific target number of
> rows you have to know total number of source rows and that's not
> something we can do very accurately so then you won't get 500 rows
> but approximately 500 rows.

It's actually even trickier. Even if you happen to know the exact number
of rows in the table, you can't just convert that into a percentage like
that and use it for BERNOULLI sampling. It may give you different number
of result rows, because each row is sampled independently.

That is why we have Vitter's algorithm for reservoir sampling, which
works very differently from BERNOULLI.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2015-04-10 20:44:32 Re: TABLESAMPLE patch
Previous Message Petr Jelinek 2015-04-10 19:57:58 Re: TABLESAMPLE patch