inconsistent behaviour of set-returning functions in sub-query with random()

Lists: pgsql-generalpgsql-hackers
From: Tom van Tilburg <tom(dot)van(dot)tilburg(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: inconsistent behaviour of set-returning functions in sub-query with random()
Date: 2016-09-23 20:34:32
Message-ID: CAP3PPDiucxYCNev52=YPVkrQAPVF1C5PFWnrQPT7iMzO1fiKFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi List,

Note beforehand: this question is a result of a stack-exchange that can be
seen here:
http://stackoverflow.com/questions/39624241/inconsistent-behaviour-of-set-returning-functions-in-sub-query-with-random

I'm often using the WHERE clause random() > 0.5 to pick a random subset of
my data. Now I noticed that when using a set-returning function in a
sub-query, I either get the whole set or none (meaning that the WHERE
random() > 0.5 clause is interpreted *before* the set is being generated).
e.g.:

SELECT num FROM (
SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num) AS foo WHERE random() > 0.5;

This seems inconsistent because the following query *does* take the whole
set into account:

SELECT num FROM (
SELECT * FROM unnest(Array[1,2,3,4,5,6,7,8,9,10]) num) AS foo
WHERE random() > 0.5;

So does this one:

WITH foo AS (
SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num)
SELECT num FROM foo WHERE random() > 0.5;

Could anyone reflect on the seeming inconsistency here? I do understand
that the planner sees the queries quite different (as can be seen from an
EXPLAIN) but I don't understand the rationale behind it.

Notes:

-

couldn't find another function to test apart from random(), but likely
there is some
-

I tested with generate_series and as well
-

My real use case works with postgis and pgpointcloud where a range of
set-returning functions is used in this manner

Thanks,
Tom


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom van Tilburg <tom(dot)van(dot)tilburg(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GENERAL] inconsistent behaviour of set-returning functions in sub-query with random()
Date: 2016-09-26 19:38:03
Message-ID: 1359.1474918683@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom van Tilburg <tom(dot)van(dot)tilburg(at)gmail(dot)com> writes:
> I'm often using the WHERE clause random() > 0.5 to pick a random subset of
> my data. Now I noticed that when using a set-returning function in a
> sub-query, I either get the whole set or none (meaning that the WHERE
> random() > 0.5 clause is interpreted *before* the set is being generated).
> e.g.:
>
> SELECT num FROM (
> SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num) AS foo WHERE random() > 0.5;

Hmm, I think this is an optimizer bug. There are two legitimate behaviors
here:

SELECT * FROM unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should (and does) re-evaluate the WHERE for every row output by unnest().

SELECT unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;

should evaluate WHERE only once, since that happens before expansion of the
set-returning function in the targetlist. (If you're an Oracle user and
you imagine this query as having an implicit "FROM dual", the WHERE should
be evaluated for the single row coming out of the FROM clause.)

In the case you've got here, given the placement of the WHERE in the outer
query, you'd certainly expect it to be evaluated for each row coming out
of the inner query. But the optimizer is deciding it can push the WHERE
clause down to become a WHERE of the sub-select. That is legitimate in a
lot of cases, but not when there are SRF(s) in the sub-select's
targetlist, because that pushes the WHERE to occur before the SRF(s),
analogously to the change between the two queries I wrote.

I'm a bit hesitant to change this in existing releases. Given the lack
of previous complaints, it seems more likely to break queries that were
behaving as-expected than to make people happy. But we could change it
in v10 and up, especially since some other corner-case changes in
SRF-in-tlist behavior are afoot.

In the meantime, you could force it to work as you wish by inserting the
all-purpose optimization fence "OFFSET 0" in the sub-select:

=# SELECT num FROM (
SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num OFFSET 0) AS foo WHERE random() > 0.5;
num
-----
1
4
7
9
(4 rows)

regards, tom lane


From: Tom van Tilburg <tom(dot)van(dot)tilburg(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: inconsistent behaviour of set-returning functions in sub-query with random()
Date: 2016-09-27 09:19:17
Message-ID: CAP3PPDj3_a1fOM8EPteDj++BYivB9GS_NB8UY6Y_XBEhbP5odQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Good to know and I agree that it is not an urgent case.
I think this practice might be more common in the POSTGIS community where
there are plenty of set-returning-functions used in this way. My use was
taking a random sample of a pointcloud distrubution.

I took the liberty to post your answer at stackexchange.

thanks,
Tom

On Mon, 26 Sep 2016 at 21:38 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Tom van Tilburg <tom(dot)van(dot)tilburg(at)gmail(dot)com> writes:
> > I'm often using the WHERE clause random() > 0.5 to pick a random subset
> of
> > my data. Now I noticed that when using a set-returning function in a
> > sub-query, I either get the whole set or none (meaning that the WHERE
> > random() > 0.5 clause is interpreted *before* the set is being
> generated).
> > e.g.:
> >
> > SELECT num FROM (
> > SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num) AS foo WHERE
> random() > 0.5;
>
> Hmm, I think this is an optimizer bug. There are two legitimate behaviors
> here:
>
> SELECT * FROM unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;
>
> should (and does) re-evaluate the WHERE for every row output by unnest().
>
> SELECT unnest(ARRAY[1,2,3,4,5,6,7,8,9,10]) WHERE random() > 0.5;
>
> should evaluate WHERE only once, since that happens before expansion of the
> set-returning function in the targetlist. (If you're an Oracle user and
> you imagine this query as having an implicit "FROM dual", the WHERE should
> be evaluated for the single row coming out of the FROM clause.)
>
> In the case you've got here, given the placement of the WHERE in the outer
> query, you'd certainly expect it to be evaluated for each row coming out
> of the inner query. But the optimizer is deciding it can push the WHERE
> clause down to become a WHERE of the sub-select. That is legitimate in a
> lot of cases, but not when there are SRF(s) in the sub-select's
> targetlist, because that pushes the WHERE to occur before the SRF(s),
> analogously to the change between the two queries I wrote.
>
> I'm a bit hesitant to change this in existing releases. Given the lack
> of previous complaints, it seems more likely to break queries that were
> behaving as-expected than to make people happy. But we could change it
> in v10 and up, especially since some other corner-case changes in
> SRF-in-tlist behavior are afoot.
>
> In the meantime, you could force it to work as you wish by inserting the
> all-purpose optimization fence "OFFSET 0" in the sub-select:
>
> =# SELECT num FROM (
> SELECT unnest(Array[1,2,3,4,5,6,7,8,9,10]) num OFFSET 0) AS foo WHERE
> random() > 0.5;
> num
> -----
> 1
> 4
> 7
> 9
> (4 rows)
>
>
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom van Tilburg <tom(dot)van(dot)tilburg(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [GENERAL] inconsistent behaviour of set-returning functions in sub-query with random()
Date: 2016-09-27 22:48:19
Message-ID: 10246.1475016499@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom van Tilburg <tom(dot)van(dot)tilburg(at)gmail(dot)com> writes:
> Good to know and I agree that it is not an urgent case.
> I think this practice might be more common in the POSTGIS community where
> there are plenty of set-returning-functions used in this way. My use was
> taking a random sample of a pointcloud distrubution.

Fix pushed to HEAD only. Thanks for the report!

regards, tom lane