Re: TABLESAMPLE patch

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-04-10 00:58:56
Message-ID: 55272050.8080607@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/04/15 21:30, Peter Eisentraut wrote:
>
> In the SQL standard, the TABLESAMPLE clause is attached to a table
> expression (<table primary>), which includes table functions,
> subqueries, CTEs, etc. In the proposed patch, it is attached to a table
> name, allowing only an ONLY clause. So this is a significant deviation.
>

I wouldn't call something that implements subset of standard a
deviation. Especially if other major dbs have chosen same approach
(afaik the only db that supports sampling over something besides
physical relations is Oracle but their sampling works slightly
differently than what standard has).

> Obviously, doing block sampling on a physical table is a significant use
> case, but we should be clear about which restrictions and tradeoffs were
> are making now and in the future, especially if we are going to present
> extension interfaces. The fact that physical tables are interchangeable
> with other relation types, at least in data-reading contexts, is a
> feature worth preserving.

Yes, but I don't think there is anything that prevents us from adding
this in the future. The sampling scan could made to be able to read both
directly from heap and from executor subnode which is doable even if it
won't be extremely pretty (but it should be easy to encapsulate into 2
internal interfaces as the heap reading is encapsulated to 1 internal
interface already). Another approach would be having two different
executor nodes - SampingScan and SamplingFilter and letting planner pick
one depending on what is the source for TABLESAMPLE clause.

The extension api is currently mainly:
nextblock - gets next blockid to read from heap
nextuple - gets next tupleid to read current block
examinetuple - lets the extension to decide if tuple should be indeed
returned (this one is optional)

For the executor node reading we'd probably just call the examinetuple
as there are no block ids or tuple ids there. This makes the API look
slightly schizophrenic but on the other hand it gives the plugins
control over how physical relation is read if that's indeed the source.
And I guess we could let the plugin specify if it supports the heap
access (nextblock/nexttuple) and if it doesn't then planner would always
choose SamplingFilter over SequentialScan for physical relation instead
of SamplingScan.

All of this is possible to add without breaking compatibility with what
is proposed for commit currently.

The reasons why we need the nextblock and nexttuple interfaces and the
ability to read the heap directly are a) block sampling can't be done by
reading from another executor node, b) performance.

>
> It may be worth thinking about some examples of other sampling methods,
> in order to get a better feeling for whether the interfaces are appropriate.
>

There is one additional method which is just purely for testing the
interface and that uses column value to determine if the tuple should be
returned or not (which is useless in practice obviously as you can do
that using WHERE, it just shows how to use the interface).

I would like to eventually have something that's time limited rather
than size limited for example. I didn't think much about other sampling
algorithms but Simon proposed some and they should work with the current
API.

> Earlier in the thread, someone asked about supporting specifying a
> number of rows instead of percents. While not essential, that seems
> pretty useful, but I wonder how that could be implemented later on if we
> take the approach that the argument to the sampling method can be an
> arbitrary quantity that is interpreted only by the method.
>

Well, you can have two approaches to this, either allow some specific
set of keywords that can be used to specify limit, or you let sampling
methods interpret parameters, I believe the latter is more flexible.
There is nothing stopping somebody writing sampling method which takes
limit as number of rows, or anything else.

Also for example for BERNOULLI to work correctly you'd need to convert
the number of rows to fraction of table anyway (and that's exactly what
the one database which has this feature does internally) and then it's
no different than passing (SELECT 100/reltuples*number_of_rows FROM
tablename) as a parameter.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2015-04-10 01:05:25 Re: libpq's multi-threaded SSL callback handling is busted
Previous Message Simon Riggs 2015-04-09 23:47:49 Re: TABLESAMPLE patch