Terrible plan for join to nested union

Lists: pgsql-performance
From: Nate Allan <nallan(at)ancestry(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Terrible plan for join to nested union
Date: 2012-07-07 22:35:06
Message-ID: 9B2D6747F4AB8A47BE45216B06DEDAF92ABE667B@PREXMB01.myfamily.int
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I have a query which joins to a nested union and I'm getting a plan which never returns. Here is the query simplified as much as possible:

select 'anything' as result
from "Attribute" as A1
inner join
(
select R."TargetID" as "SourceID"
from "Relationship" as R
union
select A2."PersonID" as "SourceID"
from "Attribute" as A2
) as X on (A1."PersonID" = X."SourceID")
where (A1."ID" = 124791200)

(this seems like a strange query, but it is simplified to eliminate everything I could)

Here is the execution plan I am seeing:
http://explain.depesz.com/s/BwUd

Merge Join (cost=229235406.73..244862067.56 rows=727 width=0)
Output: 'anything'
Merge Cond: (r."TargetID" = a1."PersonID")
-> Unique (cost=229235336.51..233700093.63 rows=892951424 width=8)
Output: r."TargetID"
-> Sort (cost=229235336.51..231467715.07 rows=892951424 width=8)
Output: r."TargetID"
Sort Key: r."TargetID"
-> Append (cost=0.00..23230287.48 rows=892951424 width=8)
-> Seq Scan on public."Relationship" r (cost=0.00..5055084.88 rows=328137088 width=8)
Output: r."TargetID"
-> Seq Scan on public."Attribute" a2 (cost=0.00..9245688.36 rows=564814336 width=8)
Output: a2."PersonID"
-> Materialize (cost=70.22..70.23 rows=1 width=8)
Output: a1."PersonID"
-> Sort (cost=70.22..70.23 rows=1 width=8)
Output: a1."PersonID"
Sort Key: a1."PersonID"
-> Index Scan using "UIDX_Attribute_ID" on public."Attribute" a1 (cost=0.00..70.21 rows=1 width=8)
Output: a1."PersonID"
Index Cond: (a1."ID" = 124791200)

As you can see, the Relationship table has ~300 million rows and Attribute has ~500 million rows. I could not include the explain analyze because the query never completes. Going to "union all" fixes it, nesting the restriction fixes it, making the restriction limit X rather than A1 fixes it. Unfortunately, none of these "fixes" are acceptable within the context of the complete query this was simplified from.

Version string: PostgreSQL 9.1.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
OS: CentOS 5
RAM: 128GB
Processor: AMD Opteron(tm) 6174, 24 cores

I've not changed any configuration settings from the based EnterpriseDB installer besides shared_buffers. Presently the DB is static, and I have executed analyze to update the stats since loading it.

Relevant schema:

CREATE TABLE "Attribute"
(
"ID" bigint NOT NULL,
"PersonID" bigint NOT NULL,
"Type" character varying(5) NOT NULL
)
WITH ( OIDS=FALSE);

CREATE INDEX "IDX_Attribute_PersonID_Type" ON "Attribute" USING btree
("PersonID" , "Type" COLLATE pg_catalog."default" );

CREATE UNIQUE INDEX "UIDX_Attribute_ID"
ON "Attribute" USING btree ("ID" );

CREATE TABLE "Relationship"
(
"ID" bigint NOT NULL,
"TargetID" bigint NOT NULL
) WITH ( OIDS=FALSE);

CREATE INDEX "IDX_Relationship_TargetID"
ON "Relationship" USING btree ("TargetID" );

CREATE UNIQUE INDEX "UIDX_Relationship_ID"
ON "Relationship" USING btree ("ID" );

Thanks,

-Nate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nate Allan <nallan(at)ancestry(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-08 00:08:10
Message-ID: 19124.1341706090@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Nate Allan <nallan(at)ancestry(dot)com> writes:
> I have a query which joins to a nested union and I'm getting a plan which never returns. Here is the query simplified as much as possible:
> select 'anything' as result
> from "Attribute" as A1
> inner join
> (
> select R."TargetID" as "SourceID"
> from "Relationship" as R
> union
> select A2."PersonID" as "SourceID"
> from "Attribute" as A2
> ) as X on (A1."PersonID" = X."SourceID")
> where (A1."ID" = 124791200)

What exactly are you trying to accomplish here? AFAICS, the UNION
result must include every possible value of Attribute.PersonID, which
means the inner join cannot eliminate any rows of A1 (except those with
null PersonID), which seems a tad silly.

Anyway, I wonder whether you'd get better results with an EXISTS over
a correlated UNION ALL subquery, ie, something like

select 'anything' as result
from "Attribute" as A1
where (A1."ID" = 124791200)
and exists (
select 1 from "Relationship" as R
where R."TargetID" = A1."PersonID"
union all
select 1 from "Attribute" as A2
where A2."PersonID" = A1."PersonID"
)

since you're evidently hoping that the EXISTS won't need to be evaluated
for very many rows of A1. Or you could use an OR of two EXISTS to skip
the UNION altogether.

regards, tom lane


From: Nate Allan <nallan(at)ancestry(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-08 05:50:01
Message-ID: 9B2D6747F4AB8A47BE45216B06DEDAF92ABE66BC@PREXMB01.myfamily.int
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks for your reply Tom.

>> I have a query which joins to a nested union and I'm getting a plan which never returns. Here is the query simplified as much as possible:
>> select 'anything' as result
>> from "Attribute" as A1
>> inner join
>> (
>> select R."TargetID" as "SourceID"
>> from "Relationship" as R
>> union
>> select A2."PersonID" as "SourceID"
>> from "Attribute" as A2
>> ) as X on (A1."PersonID" = X."SourceID")
>> where (A1."ID" = 124791200)
>
> AFAICS, the UNION result must include every possible value of Attribute.PersonID, which means the inner join cannot
>eliminate any rows of A1 (except those with null PersonID), which seems a tad silly.

It seems to me that the join condition (and hence the restriction) should be pushed down into both sides of the union to bring the cardinality limit from millions to 1. I'm imagining a rewrite like this:
R(a) J (b U c) -> (b J R(a)) U (c J R(a))
...where R = Restrict, J = Join, U = Union

This is the kind of rewrite I would make as a sentient being and it's one that at least one other DBMS I know of makes.

As an aside, even though not as good as pushing down the restriction, the plan that the "union all" produces is decent performance-wise:
http://explain.depesz.com/s/OZq
It seems to me that a similar alternative could be applied for a distinct union by using two Index Scans followed by a Merge Join.

>What exactly are you trying to accomplish here?

I state in my post that there are several ways to rewrite the query to work-around the issue; I'm not really asking for a work-around but a) wondering why the plan is so bad; and b) asking if it could be fixed if possible. Unfortunately rewriting the query isn't a trivial matter in our case because the X (union) part of the query is represented logically as a view, which is expected to be restricted and/or joined so as not to actually materialize the actual union. Unfortunately the PostgreSQL planner seems to want to actually materialize that view. Working around this would basically entail not using the view, which is used all over the place, and instead duplicating the view's logic except pushing the restrictions and/or joins down into both sides of the union in each case. I could do that, but doing so would be: a) against the spirit of the Relational Model; b) against the spirit of "fix the planner rather than add optimizer hints"; c) a royal pain because it causes a rewrite of application logic; d) a point for at least one other DBMS's optimizer. :-)

>Anyway, I wonder whether you'd get better results with an EXISTS over a correlated UNION ALL subquery, ie, something like
> ...

Thanks for the work-arounds, but again, that's not quite what I'm after.

Best,

-Nate


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Nate Allan <nallan(at)ancestry(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-08 06:03:26
Message-ID: CAFj8pRD5Hn92L_uJfr0DKcHCOW72VJ7Y0T7VhT6_kP3BUSgNjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

2012/7/8 Nate Allan <nallan(at)ancestry(dot)com>:
> Thanks for your reply Tom.
>
>>> I have a query which joins to a nested union and I'm getting a plan which never returns. Here is the query simplified as much as possible:
>>> select 'anything' as result
>>> from "Attribute" as A1
>>> inner join
>>> (
>>> select R."TargetID" as "SourceID"
>>> from "Relationship" as R
>>> union
>>> select A2."PersonID" as "SourceID"
>>> from "Attribute" as A2
>>> ) as X on (A1."PersonID" = X."SourceID")
>>> where (A1."ID" = 124791200)
>>
>> AFAICS, the UNION result must include every possible value of Attribute.PersonID, which means the inner join cannot
>>eliminate any rows of A1 (except those with null PersonID), which seems a tad silly.
>
> It seems to me that the join condition (and hence the restriction) should be pushed down into both sides of the union to bring the cardinality limit from millions to 1. I'm imagining a rewrite like this:
> R(a) J (b U c) -> (b J R(a)) U (c J R(a))
> ...where R = Restrict, J = Join, U = Union
>
> This is the kind of rewrite I would make as a sentient being and it's one that at least one other DBMS I know of makes.
>
> As an aside, even though not as good as pushing down the restriction, the plan that the "union all" produces is decent performance-wise:
> http://explain.depesz.com/s/OZq
> It seems to me that a similar alternative could be applied for a distinct union by using two Index Scans followed by a Merge Join.
>
>>What exactly are you trying to accomplish here?
>
> I state in my post that there are several ways to rewrite the query to work-around the issue; I'm not really asking for a work-around but a) wondering why the plan is so bad; and b) asking if it could be fixed if possible. Unfortunately rewriting the query isn't a trivial matter in our case because the X (union) part of the query is represented logically as a view, which is expected to be restricted and/or joined so as not to actually materialize the actual union. Unfortunately the PostgreSQL planner seems to want to actually materialize that view. Working around this would basically entail not using the view, which is used all over the place, and instead duplicating the view's logic except pushing the restrictions and/or joins down into both sides of the union in each case. I could do that, but doing so would be: a) against the spirit of the Relational Model; b) against the spirit of "fix the planner rather than add optimizer hints"; c) a royal pain because it causes a rewrite of application logic; d) a point for at least one other DBMS's optimizer. :-)

you are using EAV schema - it is against to relation model enough :)

this schema has the most terrible performance for large datasets -
looks on hstore instead

Regards

Pavel

>
>>Anyway, I wonder whether you'd get better results with an EXISTS over a correlated UNION ALL subquery, ie, something like
>> ...
>
> Thanks for the work-arounds, but again, that's not quite what I'm after.
>
> Best,
>
> -Nate
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


From: Nate Allan <nallan(at)ancestry(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-08 09:43:58
Message-ID: 9B2D6747F4AB8A47BE45216B06DEDAF92ABE66E3@PREXMB01.myfamily.int
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>you are using EAV schema - it is against to relation model enough :)
>this schema has the most terrible performance for large datasets - looks on hstore instead

>Pavel

Actually despite the table named Attribute, I am not doing EAV though I can see why you'd think that. Attributes are part of the conceptual domain I'm modeling and I assure you there are first class columns in the schema for everything. Regardless, that has nothing to do with my performance problem with joining to a nested union.

-Nate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nate Allan <nallan(at)ancestry(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-08 15:56:35
Message-ID: 3857.1341762995@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Nate Allan <nallan(at)ancestry(dot)com> writes:
> It seems to me that the join condition (and hence the restriction) should be pushed down into both sides of the union to bring the cardinality limit from millions to 1. I'm imagining a rewrite like this:
> R(a) J (b U c) -> (b J R(a)) U (c J R(a))
> ...where R = Restrict, J = Join, U = Union

[ eyes that suspiciously ... ] I'm not convinced that such a
transformation is either correct in general (you seem to be assuming
at least that A's join column is unique, and what is the UNION operator
supposed to do with A's other columns?) or likely to lead to a
performance improvement in general.

We possibly could push down a join condition on the inner side of a
nestloop, similarly to what's done in the UNION ALL case ... but that
would require a complete refactoring of what the planner does with
UNIONs. By and large, very little optimization effort has been put
into non-ALL UNION (or INTERSECT or EXCEPT). You should not expect
that to change on a time scale of less than years.

regards, tom lane


From: Nate Allan <nallan(at)ancestry(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-09 04:02:23
Message-ID: 9B2D6747F4AB8A47BE45216B06DEDAF92ABE6739@PREXMB01.myfamily.int
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>>Nate Allan <nallan(at)ancestry(dot)com> writes:
>> It seems to me that the join condition (and hence the restriction) should be pushed down into both sides of the union to bring the cardinality limit from millions to 1. I'm imagining a rewrite like this:
>> R(a) J (b U c) -> (b J R(a)) U (c J R(a)) ...where R = Restrict, J
>> = Join, U = Union

>[ eyes that suspiciously ... ] I'm not convinced that such a transformation is either correct in general (you seem to be assuming at least that A's join column is unique, and >what is the UNION operator supposed to do with A's other columns?) or likely to lead to a performance improvement in general.

If there are more columns, you are correct that you might have to project off any additional columns within the union, and leave the join outside of the union intact to bring in the extra columns. Those are essentially the same considerations as when making other rewrites though. As for this optimization making unions faster in general, I would argue that it is rather easy to produce a plan superior to complete materialization of the union.

>We possibly could push down a join condition on the inner side of a nestloop, similarly to what's done in the UNION ALL case ... but that would require a complete >refactoring of what the planner does with UNIONs. By and large, very little optimization effort has been put into non-ALL UNION (or INTERSECT or EXCEPT). You should >not expect that to change on a time scale of less than years.

I hate to come across as contrary, but I'm pretty shocked by this answer for a couple reasons:
1) This is a clear-cut case of an untenable execution plan, essentially a bug in the planner. This response contradicts the widely broadcast assertion that the PG community fixes planner bugs quickly and will not introduce hints because they would rather address these kinds of issues "correctly".
2) Why would more effort go into Union All rather than Union? Are people using Union All more than Union, and if so is this because they actually want duplicates or is it because they've been trained to due to the performance problems with Union? Union All, in many people's opinions, shouldn't even exist in a true relational sense.

Again, sorry if I'm coming off as abrasive, I've spent political capital pushing to get PG in on this project, and now I'm a little worried about whether it is going to work for this kind of scale and complexity, so I'm a little stressed. I do appreciate your responses.

Best,

-Nate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nate Allan <nallan(at)ancestry(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-09 05:49:52
Message-ID: 29374.1341812992@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Nate Allan <nallan(at)ancestry(dot)com> writes:
> 2) Why would more effort go into Union All rather than Union?

The UNION ALL case matches up with, and shares planning and execution
code with, table-inheritance and partitioning scenarios. So yes, it
really is more interesting to more people than UNION DISTINCT.
(IIRC, the code that does that stuff was originally meant to support the
inheritance case, and we hacked UNION ALL to be able to share the logic,
not vice versa.)

Right now, UNION DISTINCT, along with INTERSECT and EXCEPT, have
basically no optimization support whatsoever: all of them go through a
code path that just evaluates both input relations and performs the
set-combination operation. All of that code dates from a period about
a dozen years ago when we were more interested in getting the right
answer at all than how fast it was. Rewriting it all to have some
optimization capability is certainly on the wish-list ... but the fact
that it hasn't risen to the top of anybody's to-do list in that time
indicates to me that it probably isn't going to get done in the next
little while either. And even if someone were to start working on it
right now, it's not a small project.

Sorry to be the bearer of bad news, but this isn't going to change
just because you try to label it a bug.

regards, tom lane


From: Nate Allan <nallan(at)ancestry(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Terrible plan for join to nested union
Date: 2012-07-09 06:20:41
Message-ID: 9B2D6747F4AB8A47BE45216B06DEDAF92ABE6791@PREXMB01.myfamily.int
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>Right now, UNION DISTINCT, along with INTERSECT and EXCEPT, have basically no optimization support whatsoever...
> Sorry to be the bearer of bad news, but this isn't going to change just because you try to label it a bug.

Given the medium, I'll try not to read that in a snarky tone, after all, surely it's not unreasonable to label it a defect for a system not to optimize one of the basic relational primitives. That said, I know well the annoyance when a user cries bug when the system is working as-designed. In any case, I'm at least glad to have resolution; I know that there is no choice but to work around it.

For a maximally general work-around given that the union is the essence of a reused view, perhaps a reasonable approach is to switch to Union All and nest it within a Distinct outer query. That seems to produce workable plans in my tests so far. Maybe that could even form the basis of a planner enhancement that wouldn't require a complete refactor.

Thanks again,

-Nate