Re: Planner not using column limit specified for one column for another column equal to first

Lists: pgsql-performance
From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 08:02:06
Message-ID: m2y331e40661004160102tff445f9eg165a04efe3b409ed@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hello.

I have a query that performs very poor because there is a limit on join
column that is not applied to other columns:

select * from company this_ left outer join company_tag this_1_ on
this_.id=this_1_.company_id left outer join company_measures companymea2_ on
this_.id=companymea2_.company_id left outer join company_descr ces3_ on
this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000
and this_1_.company_id>50000000
order by this_.id asc limit 1000;

(plan1.txt)
Total runtime: 7794.692 ms

At the same time if I apply the limit (>50000000) to other columns in query
itself it works like a charm:

select * from company this_ left outer join company_tag this_1_ on
this_.id=this_1_.company_id left outer join company_measures companymea2_ on
this_.id=companymea2_.company_id left outer join company_descr ces3_ on
this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000
and this_1_.company_id>50000000
and companymea2_.company_id>50000000 and ces3_.company_id>50000000
order by this_.id asc limit 1000;

(plan2.txt)
Total runtime: 27.547 ms

I've thought and someone in this list've told me that this should be done
automatically. But I have pretty recent server:
PostgreSQL 8.4.2 on amd64-portbld-freebsd8.0, compiled by GCC cc (GCC) 4.2.1
20070719 [FreeBSD], 64-bit
and it still do not work

Do I misunderstand something or this feature don't work in such a query?

Best regards, Vitalii Tymchyshyn

Attachment Content-Type Size
plan1.txt text/plain 1.6 KB
plan2.txt text/plain 1.7 KB

From: Hannu Krosing <hannu(at)2ndquadrant(dot)com>
To: Віталій Тимчишин <tivv00(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 08:25:25
Message-ID: 1271406325.27474.24.camel@hvost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote:
> Hello.
>
>
> I have a query that performs very poor because there is a limit on
> join column that is not applied to other columns:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> order by this_.id asc limit 1000;
>
>
> (plan1.txt)
> Total runtime: 7794.692 ms
>
>
> At the same time if I apply the limit (>50000000) to other columns in
> query itself it works like a charm:
>
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> and companymea2_.company_id>50000000 and ces3_.company_id>50000000
> order by this_.id asc limit 1000;

The queries are not the same.

2nd variant will not return the rows where there are no matching rows
inthis_1_ , companymea2_ or ces3_.company_id

A query equivalent to first one would be:

select * from company this_
left outer join company_tag this_1_
on (this_.id=this_1_.company_id
and this_1_.company_id>50000000)
left outer join company_measures companymea2_
on (this_.id=companymea2_.company_id
and companymea2_.company_id>50000000)
left outer join company_descr ces3_
on (this_.id=ces3_.company_id
and ces3_.company_id>50000000)
where this_1_.tag_id = 7
and this_.id>50000000
order by this_.id asc
limit 1000;

I'm not sure that planner considers the above form of plan rewrite, nor
that it would make much sense to do so unless there was a really small
number of rows where x_.company_id>50000000

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training


From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Віталій Тимчишин <tivv00(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 08:31:06
Message-ID: 4BC8204A.7050904@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Віталій Тимчишин wrote:
> Hello.
>
> I have a query that performs very poor because there is a limit on
> join column that is not applied to other columns:
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> order by this_.id asc limit 1000;
>
> (plan1.txt)
> Total runtime: 7794.692 ms
>
> At the same time if I apply the limit (>50000000) to other columns in
> query itself it works like a charm:
>
> select * from company this_ left outer join company_tag this_1_ on
> this_.id=this_1_.company_id left outer join company_measures
> companymea2_ on this_.id=companymea2_.company_id left outer join
> company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> = 7 and this_.id>50000000
> and this_1_.company_id>50000000
> and companymea2_.company_id>50000000 and ces3_.company_id>50000000
> order by this_.id asc limit 1000;
>
> (plan2.txt)
> Total runtime: 27.547 ms
>
> I've thought and someone in this list've told me that this should be
> done automatically.
Yes, if you have in a query a=b and b=c, then the optimizer figures out
that a=c as well. (a,b and c are then member of the same equivalence class).

However both queries are not the same, since the joins you're using are
outer joins. In the first it's possible that records are returned for
company records with no matching ces3_ records, the ces3_ records is
null in that case. In the second query no NULL ces3_ information may be
returned.

Another thing is it seems that the number of rows guessed is far off
from the actual number of rows, is the number 5000000 artificial or are
you're statistics old or too small histogram/mcv's?

regards,
Yeb Havinga


From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To: Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 12:49:45
Message-ID: k2u331e40661004160549td6bf86d5pb6fde70e6d1ab5d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

16 квітня 2010 р. 11:31 Yeb Havinga <yebhavinga(at)gmail(dot)com> написав:

> Віталій Тимчишин wrote:
>
>> Hello.
>>
>> I have a query that performs very poor because there is a limit on join
>> column that is not applied to other columns:
>>
>> select * from company this_ left outer join company_tag this_1_ on
>> this_.id=this_1_.company_id left outer join company_measures companymea2_ on
>> this_.id=companymea2_.company_id left outer join company_descr ces3_ on
>> this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 and
>> this_1_.company_id>50000000
>> order by this_.id asc limit 1000;
>>
>> (plan1.txt)
>> Total runtime: 7794.692 ms
>>
>> At the same time if I apply the limit (>50000000) to other columns in
>> query itself it works like a charm:
>>
>> select * from company this_ left outer join company_tag this_1_ on
>> this_.id=this_1_.company_id left outer join company_measures companymea2_ on
>> this_.id=companymea2_.company_id left outer join company_descr ces3_ on
>> this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>50000000 and
>> this_1_.company_id>50000000
>> and companymea2_.company_id>50000000 and ces3_.company_id>50000000
>> order by this_.id asc limit 1000;
>>
>> (plan2.txt)
>> Total runtime: 27.547 ms
>>
>> I've thought and someone in this list've told me that this should be done
>> automatically.
>>
> Yes, if you have in a query a=b and b=c, then the optimizer figures out
> that a=c as well. (a,b and c are then member of the same equivalence class).
>
> However both queries are not the same, since the joins you're using are
> outer joins. In the first it's possible that records are returned for
> company records with no matching ces3_ records, the ces3_ records is null in
> that case. In the second query no NULL ces3_ information may be returned.
>

OK, but when I move limit to join condition the query is still fast:

select * from company this_ left outer join company_tag this_1_ on
this_.id=this_1_.company_id
left outer join company_measures companymea2_ on
this_.id=companymea2_.company_id and companymea2_.company_id>50000000
left outer join company_descr ces3_ on this_.id=ces3_.company_id and
ces3_.company_id>50000000
where this_1_.tag_id = 7 and this_.id>50000000
and this_1_.company_id>50000000
order by this_.id asc limit 1000;

(plan3.txt),
Total runtime: 26.327 ms
BTW: Changing slow query to inner joins do not make it fast

>
> Another thing is it seems that the number of rows guessed is far off from
> the actual number of rows, is the number 5000000 artificial or are you're
> statistics old or too small histogram/mcv's?
>

Nope, I suppose this is because of limit. If I remove the limit, the
estimations are quite correct. There are ~6 millions of row in each table.

Attachment Content-Type Size
plan3.txt text/plain 1.7 KB

From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 12:59:50
Message-ID: k2y331e40661004160559zb22b8ac2i3d1909c0c8ff0704@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

16 квітня 2010 р. 11:25 Hannu Krosing <hannu(at)2ndquadrant(dot)com> написав:

> On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote:
> > Hello.
> >
> >
> > I have a query that performs very poor because there is a limit on
> > join column that is not applied to other columns:
> >
> >
> > select * from company this_ left outer join company_tag this_1_ on
> > this_.id=this_1_.company_id left outer join company_measures
> > companymea2_ on this_.id=companymea2_.company_id left outer join
> > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> > = 7 and this_.id>50000000
> > and this_1_.company_id>50000000
> > order by this_.id asc limit 1000;
> >
> >
> > (plan1.txt)
> > Total runtime: 7794.692 ms
> >
> >
> > At the same time if I apply the limit (>50000000) to other columns in
> > query itself it works like a charm:
> >
> >
> > select * from company this_ left outer join company_tag this_1_ on
> > this_.id=this_1_.company_id left outer join company_measures
> > companymea2_ on this_.id=companymea2_.company_id left outer join
> > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id
> > = 7 and this_.id>50000000
> > and this_1_.company_id>50000000
> > and companymea2_.company_id>50000000 and ces3_.company_id>50000000
> > order by this_.id asc limit 1000;
>
> The queries are not the same.
>
> 2nd variant will not return the rows where there are no matching rows
> inthis_1_ , companymea2_ or ces3_.company_id
>
> A query equivalent to first one would be:
>
>
> select * from company this_
> left outer join company_tag this_1_
> on (this_.id=this_1_.company_id
> and this_1_.company_id>50000000)
> left outer join company_measures companymea2_
> on (this_.id=companymea2_.company_id
> and companymea2_.company_id>50000000)
> left outer join company_descr ces3_
> on (this_.id=ces3_.company_id
> and ces3_.company_id>50000000)
> where this_1_.tag_id = 7
> and this_.id>50000000
> order by this_.id asc
> limit 1000;
>

And it's still fast (see plan in another mail), while "inner join" variant
of original query is still slow.

>
>
> I'm not sure that planner considers the above form of plan rewrite, nor
> that it would make much sense to do so unless there was a really small
> number of rows where x_.company_id>50000000
>
> Actually no,
select id > 50000000, count(*) from company group by 1
f,1096042
t,5725630

I don't know why the planner wishes to perform few merges of 1000 to a
million of records (and the merges is the thing that takes time) instead of
taking a 1000 of records from main table and then doing a nested loop. And
it must read all the records that DO NOT match the criteria for secondary
tables before getting to correct records if it do not filter secondary
tables with index on retrieve.

set enable_mergejoin=false helps original query, but this is another problem
and first solution is simpler and can be used by planner automatically,
while second requires rethinking/rewrite of LIMIT estimation logic
(Plan of nested loop attached)

Attachment Content-Type Size
nestplan.txt text/plain 1.4 KB

From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Віталій Тимчишин <tivv00(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 13:21:30
Message-ID: 4BC8645A.2060700@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Віталій Тимчишин wrote:
>
> BTW: Changing slow query to inner joins do not make it fast
I'm interested to see the query andplan of the slow query with inner joins.

>
>
> Another thing is it seems that the number of rows guessed is far
> off from the actual number of rows, is the number 5000000
> artificial or are you're statistics old or too small histogram/mcv's?
>
>
> Nope, I suppose this is because of limit. If I remove the limit, the
> estimations are quite correct. There are ~6 millions of row in each table.
Yes, that makes sense.

regards,
Yeb Havinga


From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To: Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 13:58:22
Message-ID: u2s331e40661004160658s842c95d4x22d212dc8cdbbb4e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

16 квітня 2010 р. 16:21 Yeb Havinga <yebhavinga(at)gmail(dot)com> написав:

> Віталій Тимчишин wrote:
>
>>
>> BTW: Changing slow query to inner joins do not make it fast
>>
> I'm interested to see the query andplan of the slow query with inner joins.
>
>
> Here you are. The query:

select * from company this_ inner join company_tag this_1_ on
this_.id=this_1_.company_id
inner join company_measures companymea2_ on
this_.id=companymea2_.company_id
inner join company_descr ces3_ on this_.id=ces3_.company_id
where this_1_.tag_id = 7 and this_.id>50000000
order by this_.id asc
limit 1000
;
Total runtime: 14088.942 ms
(plan is attached)

Best regards, Vitalii Tymchyshyn

Attachment Content-Type Size
innerplan.txt text/plain 1.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Віталій Тимчишин <tivv00(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 14:19:47
Message-ID: 8534.1271427587@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

=?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00(at)gmail(dot)com> writes:
> I've thought and someone in this list've told me that this should be done
> automatically.

No, that's not true. We do make deductions about transitive equalities,
ie, given WHERE a=b AND b=c the planner will infer a=c and use that if
it's helpful. We don't make deductions about inequalities such as a>c.
In theory there's enough information available to do so, but overall
trying to do that would probably waste more cycles than it would save.
You'd need a lot of expensive new planner infrastructure, and in the
vast majority of queries it wouldn't produce anything very helpful.

As was pointed out, even if we had such logic it wouldn't apply in this
example, because the equality conditions aren't real equalities but
OUTER JOIN conditions.

regards, tom lane


From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Віталій Тимчишин <tivv00(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 15:33:56
Message-ID: 4BC88364.9050404@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom Lane wrote:
> =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00(at)gmail(dot)com> writes:
>
>> I've thought and someone in this list've told me that this should be done
>> automatically.
>>
>
> No, that's not true. We do make deductions about transitive equalities,
> ie, given WHERE a=b AND b=c the planner will infer a=c and use that if
> it's helpful. We don't make deductions about inequalities such as a>c.
> In theory there's enough information available to do so, but overall
> trying to do that would probably waste more cycles than it would save.
> You'd need a lot of expensive new planner infrastructure, and in the
> vast majority of queries it wouldn't produce anything very helpful.
>
New expensive planner infrastructure to support from a>b and b>c infer
a>c, yes.

But I wonder if something like Leibniz's principle of identity holds for
members of the same equivalence class, e.g. like if x,y are both members
of the same EC, then for every predicate P, P(x) iff P(y). Probably not
for every predicate (like varno = 2 or attname='x'), but for the query
evaluation, the object denoted by the variables are the same, since that
is the standard meaning of the = operator. I cannot think of any
standard (btree) operator where 'Leibniz' would fail in this case.

regards,
Yeb Havinga


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc: Віталій Тимчишин <tivv00(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-16 15:45:48
Message-ID: 10157.1271432748@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Yeb Havinga <yebhavinga(at)gmail(dot)com> writes:
> New expensive planner infrastructure to support from a>b and b>c infer
> a>c, yes.

> But I wonder if something like Leibniz's principle of identity holds for
> members of the same equivalence class, e.g. like if x,y are both members
> of the same EC, then for every predicate P, P(x) iff P(y).

This could only be assumed to apply for predicates constructed from
operators that are in the equivalence operator's btree opfamily.
Now, that would certainly be a large enough set of cases to sometimes
give useful results --- but I stand by the opinion that it wouldn't
win often enough to justify the added planner overhead.

regards, tom lane


From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-17 04:12:18
Message-ID: n2o331e40661004162112u52e89c27i77dbade091133df4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

16 квітня 2010 р. 17:19 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> написав:

> =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00(at)gmail(dot)com> writes:
> > I've thought and someone in this list've told me that this should be done
> > automatically.
>
> As was pointed out, even if we had such logic it wouldn't apply in this
> example, because the equality conditions aren't real equalities but
> OUTER JOIN conditions.
>
>
In this case you can copy condition to "ON" condition, not to where cause
and this would work correct, e.g. "select something from a join b on a.x=b.y
where a.x > n" <=> "select something from a join b on a.x=b.y and b.y > n
where a.x > n".

As of making planner more clever, may be it is possible to introduce
division on "fast queries" and "long queries", so that if after fast
planning cost is greater then some configurable threshold, advanced planning
techniques (or settings) are used. As far as I have seen in this list, many
techniques are not used simply because they are too complex and could make
planning take too much time for really fast queries, but they are vital for
long ones.
Also same (or similar) threshold could be used to enable replanning for each
run of prepared query - also an often complaint is that planned query is not
that fast as is could be.

--
Best regards,
Vitalii Tymchyshyn


From: Matthew Wakeling <matthew(at)flymine(dot)org>
To: Віталій Тимчишин <tivv00(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Planner not using column limit specified for one column for another column equal to first
Date: 2010-04-19 09:47:37
Message-ID: alpine.DEB.2.00.1004191043280.27544@aragorn.flymine.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sat, 17 Apr 2010, Віталій Тимчишин wrote:
> As of making planner more clever, may be it is possible to introduce
> division on "fast queries" and "long queries", so that if after fast
> planning cost is greater then some configurable threshold, advanced planning
> techniques (or settings) are used. As far as I have seen in this list, many
> techniques are not used simply because they are too complex and could make
> planning take too much time for really fast queries, but they are vital for
> long ones.

+1. That's definitely a good idea in my view. The query optimiser I wrote
(which sits on top of Postgres and makes use of materialised views to
speed up queries) uses a similar approach - it expends effort proportional
to the estimated cost of the query, as reported by EXPLAIN.

Matthew

--
To most people, solutions mean finding the answers. But to chemists,
solutions are things that are still all mixed up.