Re: Planner producing 100% duplicate subplans when unneeded

Lists: pgsql-bugs
From: Daniel Grace <dgrace(at)wingsnw(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Planner producing 100% duplicate subplans when unneeded
Date: 2010-09-27 21:09:14
Message-ID: AANLkTikU6k1BgcMBcgkgOGNbSxq-juHtKMPRJP5BxS9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

This is all on Postgres 9.0.0:

I'm working on the definition for a view that, as part of its output,
includes three columns that each contain a sum of values in a table
that are in one of a few different states -- essentially: for each
parent, give me the sum of children in the foo state, a second sum of
the children in the bar state, and a third sum of the children in the
baz state.

Thinking I'd be clever and avoid multiple almost-identical subqueries
to for each SUM, I decided to do it in a single subquery that returns
an array of all 3 sums and then split it out into its component parts
higher up (see the example below if you don't understand what I mean).
Unfortunately, the query planner doesn't quite handle this case: For
each reference to the subquery value, it duplicates the subquery and
thus its plan:

CREATE TABLE parent (
id INT PRIMARY KEY
);
INSERT INTO parent(id) SELECT GENERATE_SERIES(1, 1000);

CREATE TABLE child (
parent_id INT,
v1 INT,
v2 INT,
PRIMARY KEY(parent_id, v1)
);

INSERT INTO child(parent_id, v1, v2)
SELECT p.id, v1.id, RANDOM() * 500
FROM
GENERATE_SERIES(1, 1000) AS p(id)
CROSS JOIN GENERATE_SERIES(1, 20) AS v1(id)
;

EXPLAIN SELECT *, child_data[1] AS foo, child_data[2] AS bar,
child_data[3] AS baz FROM (
SELECT p.id,
(
SELECT
ARRAY[
SUM(c.v2),
SUM(CASE WHEN c.v1 > 15 THEN c.v2 ELSE 0 END),
SUM(CASE WHEN c.v1 > 5 THEN c.v2 ELSE 0 END)
]
FROM child AS c
WHERE c.parent_id=p.id
) AS child_data
FROM parent AS p
) AS t;

produces:

Seq Scan on wings_sky.parent p (cost=0.00..161113.12 rows=1000 width=4)
Output: p.id, (SubPlan 1), ((SubPlan 2))[1], ((SubPlan 3))[2], ((SubPlan 4))[3]
SubPlan 1
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)
SubPlan 2
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)
SubPlan 3
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)
SubPlan 4
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)

There's another method of writing this that is more efficient in this
PARTICULAR case:

EXPLAIN VERBOSE SELECT parent.id, t.foo, t.bar, t.baz
FROM
parent
INNER JOIN (
SELECT
child.parent_id,
SUM(child.v2) AS foo,
SUM(CASE WHEN child.v1 > 15 THEN child.v2 ELSE 0 END) AS bar,
SUM(CASE WHEN child.v1 > 5 THEN child.v2 ELSE 0 END) AS baz
FROM
child
GROUP BY
child.parent_id
) AS t ON t.parent_id=parent.id

Hash Join (cost=547.12..556.62 rows=200 width=28)
Output: parent.id, (sum(child.v2)), (sum(CASE WHEN (child.v1 > 15)
THEN child.v2 ELSE 0 END)), (sum(CASE WHEN (child.v1 > 5) THEN
child.v2 ELSE 0 END))
Hash Cond: (child.parent_id = parent.id)
-> HashAggregate (cost=483.12..487.62 rows=200 width=12)
Output: child.parent_id, sum(child.v2), sum(CASE WHEN (child.v1 > 15)
THEN child.v2 ELSE 0 END), sum(CASE WHEN (child.v1 > 5) THEN child.v2
ELSE 0 END)
-> Seq Scan on wings_sky.child (cost=0.00..291.06 rows=19206 width=12)
Output: child.parent_id, child.v1, child.v2
-> Hash (cost=34.00..34.00 rows=2400 width=4)
Output: parent.id
-> Seq Scan on wings_sky.parent (cost=0.00..34.00 rows=2400 width=4)
Output: parent.id

However, the second plan performs poorly in cases in the case where
parent_id can be more than one possible value (i.e. there is no WHERE
parent_id=1 clause) -- it takes nearly as long in that instance as it
does with no WHERE clause altogether. Both plans run the same speed
with one parent_id. The first plan starts losing speed gradually as
the number of parents increase; the second plan is either
all-or-nothing.

In the first case, it seems inefficient to duplicate the subplan for
each reference -- I'd think the (corrected) plan should look something
like this:

Seq Scan on wings_sky.parent p (cost=0.00..161113.12 rows=1000 width=4)
Output: p.id, (SubPlan 1), ((SubPlan 1))[1], ((SubPlan 1))[2], ((SubPlan 1))[3]
SubPlan 1
-> Aggregate (cost=40.26..40.27 rows=1 width=8)
Output: ARRAY[sum(c.v2), sum(CASE WHEN (c.v1 > 15) THEN c.v2 ELSE 0
END), sum(CASE WHEN (c.v1 > 5) THEN c.v2 ELSE 0 END)]
-> Index Scan using child_pkey on wings_sky.child c (cost=0.00..40.10
rows=20 width=8)
Output: c.parent_id, c.v1, c.v2
Index Cond: (c.parent_id = $0)

Is there any chance this might be looked at in a future release?

--
Daniel Grace
AGE, LLC
System Administrator and Software Developer
dgrace(at)wingsnw(dot)com // www.wingsnw.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Daniel Grace <dgrace(at)wingsnw(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Planner producing 100% duplicate subplans when unneeded
Date: 2010-10-04 18:47:17
Message-ID: AANLkTik6SzZRAULC0OajqdK_Yo+SSEdqPyQwUdP84db=@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace <dgrace(at)wingsnw(dot)com> wrote:
> Is there any chance this might be looked at in a future release?

This is another interesting example of a case where an inlining-type
optimization (which is effectively what's happening here, I think)
turns out to be a negative. We had one a while back that involved
actual function inlining, which is not quite what's happening here,
but it's close. It doesn't seem too hard to figure out whether or not
inlining is a win (non-trivial subexpressions should probably never
get duplicated), but nobody's gotten around to writing the logic to
make it work yet. One useful technique is to stick "OFFSET 0" into
the subquery; that prevents it from being inlined and gives you
something more like the plan you were hoping for.

Whether or not this will get looked at in a future release is a tough
question to answer. It's possible that someone (most likely, Tom)
will get motivated to fix this out of sheer annoyance with the current
behavior, or will notice a way to improve it incidentally while making
some other change. But of course the only way to make sure it gets
fixed is to do it yourself (or pay someone to do it).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Daniel Grace <dgrace(at)wingsnw(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Planner producing 100% duplicate subplans when unneeded
Date: 2010-10-04 20:15:48
Message-ID: AANLkTinyfs60cVNh6X8rT+63oQ2Q93njG28uCbZTbh+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

As a theoretical question (I'm not savvy on Postgres's code but might
be intrigued enough to beat on it anyways), is it feasible to do an
additional pass on the query plan that essentially goes:

- Are these two subplans identical?
- Are they at the same part of the tree?

and if both of these conditions are true, discard one subplan and
rewrite all references to point to the other one?

Assuming it IS possible, are there any particular situations where it
wouldn't work?

On Mon, Oct 4, 2010 at 11:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace <dgrace(at)wingsnw(dot)com> wrote:
>> Is there any chance this might be looked at in a future release?
>
> This is another interesting example of a case where an inlining-type
> optimization (which is effectively what's happening here, I think)
> turns out to be a negative.  We had one a while back that involved
> actual function inlining, which is not quite what's happening here,
> but it's close.  It doesn't seem too hard to figure out whether or not
> inlining is a win (non-trivial subexpressions should probably never
> get duplicated), but nobody's gotten around to writing the logic to
> make it work yet.  One useful technique is to stick "OFFSET 0" into
> the subquery; that prevents it from being inlined and gives you
> something more like the plan you were hoping for.
>
> Whether or not this will get looked at in a future release is a tough
> question to answer.  It's possible that someone (most likely, Tom)
> will get motivated to fix this out of sheer annoyance with the current
> behavior, or will notice a way to improve it incidentally while making
> some other change.  But of course the only way to make sure it gets
> fixed is to do it yourself (or pay someone to do it).
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

--
Daniel Grace
AGE, LLC
System Administrator and Software Developer
dgrace(at)wingsnw(dot)com // (425)327-0079 // www.wingsnw.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Daniel Grace <dgrace(at)wingsnw(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Planner producing 100% duplicate subplans when unneeded
Date: 2010-10-04 21:50:47
Message-ID: AANLkTinMrQ7gB0wEs3SKM-R4UnrN-gLuOAAW0tOH_sv4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Mon, Oct 4, 2010 at 4:15 PM, Daniel Grace <dgrace(at)wingsnw(dot)com> wrote:
> As a theoretical question (I'm not savvy on Postgres's code but might
> be intrigued enough to beat on it anyways), is it feasible to do an
> additional pass on the query plan that essentially goes:
>
> - Are these two subplans identical?
> - Are they at the same part of the tree?
>
> and if both of these conditions are true, discard one subplan and
> rewrite all references to point to the other one?
>
> Assuming it IS possible, are there any particular situations where it
> wouldn't work?

Well, volatile functions would be a problem, but I don't think that's
really the way to go anyway. I think that deciding whether or not to
collapse subqueries into the main query (which is what's happening
here) seems like it could be done by counting the number of times any
given subexpression is referenced, which seems like it would be a lot
cheaper than checking things against each other pairwise to see if
they match up. Slowing down the planner to detect cases like this
wouldn't be very appealing:

SELECT (SELECT SUM(a) FROM bob, SELECT SUM(b) FROM bob, SELECT SUM(c) FROM bob);

...because very few people are going to write that query that way
anyway, and unless it's almost free to notice the duplication, it
doesn't make sense to spend planning time on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Daniel Grace <dgrace(at)wingsnw(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Planner producing 100% duplicate subplans when unneeded
Date: 2010-10-05 00:55:34
Message-ID: 11094.1286240134@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Sep 27, 2010 at 5:09 PM, Daniel Grace <dgrace(at)wingsnw(dot)com> wrote:
>> Is there any chance this might be looked at in a future release?

> This is another interesting example of a case where an inlining-type
> optimization (which is effectively what's happening here, I think)
> turns out to be a negative. We had one a while back that involved
> actual function inlining, which is not quite what's happening here,
> but it's close. It doesn't seem too hard to figure out whether or not
> inlining is a win (non-trivial subexpressions should probably never
> get duplicated), but nobody's gotten around to writing the logic to
> make it work yet.

Actually this case has some history behind it. The reason that the
sub-sub-query gets duplicated is that we flatten the sub-query, resulting
in duplicating its output expressions wherever they're referenced.
Now before you say that that's stupid, please notice that this case
got disabled in 7.3 as a bug workaround, and that almost immediately
produced howls of pain:
http://archives.postgresql.org/pgsql-general/2002-12/msg00410.php
http://archives.postgresql.org/pgsql-performance/2002-12/msg00214.php
so I undid it as soon as I could:
http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=de97072e3c88e104a55b0d5c67477f1b0097c003

While just shutting off the pull-up unconditionally would merely require
reintroducing a contains_subplans() restriction in is_simple_subquery(),
I'm unwilling to do that because of the earlier complaints. And
"figuring out whether it's a win" is a lot harder than you opine above:
at the point where this decision has to be taken, we have no cost
information whatsoever, and not even any clear idea whether the subquery
outputs are referenced at all let alone multiply referenced.

My thought about the current shortest path to a solution would be to
wrap subquery-containing output expressions in PlaceHolderVars.
Per this comment elsewhere in is_simple_subquery(),

/*
* Don't pull up a subquery that has any volatile functions in its
* targetlist. Otherwise we might introduce multiple evaluations of these
* functions, if they get copied to multiple places in the upper query,
* leading to surprising results. (Note: the PlaceHolderVar mechanism
* doesn't quite guarantee single evaluation; else we could pull up anyway
* and just wrap such items in PlaceHolderVars ...)
*/

that doesn't entirely work at the moment, but it might work if we were
to rejigger the PHV mechanism a bit. This would be attractive because
the added overhead of a PHV would be nearly nil, so we could afford to
do this without making any predictions about relative costs.

As a workaround until somebody gets around to looking at that, I'd
suggest that Daniel plaster his sub-query with the good old all-purpose
optimization fence, "OFFSET 0". That will prevent flattening and thus
prevent the sub-sub-query from getting duplicated.

regards, tom lane