Re: TABLESAMPLE patch

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-04-10 19:57:58
Message-ID: 55282B46.5010506@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/04/15 21:26, Peter Eisentraut wrote:
> On 4/9/15 8:58 PM, Petr Jelinek wrote:
>> Well, you can have two approaches to this, either allow some specific
>> set of keywords that can be used to specify limit, or you let sampling
>> methods interpret parameters, I believe the latter is more flexible.
>> There is nothing stopping somebody writing sampling method which takes
>> limit as number of rows, or anything else.
>>
>> Also for example for BERNOULLI to work correctly you'd need to convert
>> the number of rows to fraction of table anyway (and that's exactly what
>> the one database which has this feature does internally) and then it's
>> no different than passing (SELECT 100/reltuples*number_of_rows FROM
>> tablename) as a parameter.
>
> What is your intended use case for this feature? I know that "give me
> 100 random rows from this table quickly" is a common use case, but
> that's kind of cumbersome if you need to apply formulas like that. I'm
> not sure what the use of a percentage is. Presumably, the main use of
> this features is on large tables. But then you might not even know how
> large it really is, and even saying 0.1% might be more than you wanted
> to handle.
>

My main intended use-case is analytics on very big tables. The
percentages of population vs confidence levels are pretty well mapped
there and you can get quite big speedups if you are fine with getting
results with slightly smaller confidence (ie you care about ballpark
figures).

But this was not really my point, the BERNOULLI just does not work well
with row-limit by definition, it applies probability on each individual
row and while you can get probability from percentage very easily (just
divide by 100), to get it for specific target number of rows you have to
know total number of source rows and that's not something we can do very
accurately so then you won't get 500 rows but approximately 500 rows.

In any case for "give me 500 somewhat random rows from table quickly"
you want probably SYSTEM sampling anyway as it will be orders of
magnitude faster on big tables and yes even 0.1% might be more than you
wanted in that case. I am not against having row limit input for methods
which can work with it like SYSTEM but then that's easily doable by
adding separate sampling method which accepts rows (even if sampling
algorithm itself is same). In current approach all you'd have to do is
write different init function for the sampling method and register it
under new name (yes it won't be named SYSTEM but for example
SYSTEM_ROWLIMIT then).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-04-10 20:16:16 Re: TABLESAMPLE patch
Previous Message Peter Eisentraut 2015-04-10 19:35:28 Re: improving speed of make check-world