Quick Links

Re: **[SPAM]*(8.2)** Re: Query optimization problem

Lists:	pgsql-hackers

From:	Zotov <zotov(at)oe-it(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Query optimization problem
Date:	2010-07-20 05:57:06
Message-ID:	4C453AB2.1000108@oe-it.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

*i wrote to
pgsql-bugs(at)postgresql(dot)org
they tell me write to
pgsql-performance(at)postgresql(dot)org
they tell me write here*

*I don`t whant know how optimize query myself (i know it), and i think
it must do planner.*

I have a query:

SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d1.ID=234409763) or (d2.ID=234409763)

i think what QO(Query Optimizer) can make it faster (now it seq scan and on
million records works 7 sec)
This Query very fast (use indexes) and easy make from first query

SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)

Next plans created on table without million rows data don`t look at exec
time

----------------------
Slow Query
----------------------
test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT
d1.ID,
d2.ID
test-# FROM DocPrimary d1
test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
test-# WHERE (d1.ID=234409763) or (d2.ID=234409763);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=58.15..132.35 rows=2 width=8) (actual
time=0.007..0.007
rows=0 loops=1)
Output: d1.id, d2.id
Hash Cond: (d2.basedon = d1.id)
Join Filter: ((d1.id = 234409763) OR (d2.id = 234409763))
-> Seq Scan on public.docprimary d2 (cost=0.00..31.40 rows=2140
width=8) (actual time=0.002..0.002 rows=0 loops=1)
Output: d2.id, d2.basedon
-> Hash (cost=31.40..31.40 rows=2140 width=4) (never executed)
Output: d1.id
-> Seq Scan on public.docprimary d1 (cost=0.00..31.40
rows=2140
width=4) (never executed)
Output: d1.id

------------------
Fast Query
------------------
test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT
d1.ID,
d2.ID
test-# FROM DocPrimary d1
test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
test-# WHERE (d2.BasedOn=234409763) or (d2.ID=234409763);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=8.60..58.67 rows=12 width=8) (actual
time=0.026..0.026
rows=0 loops=1)
Output: d1.id, d2.id
-> Bitmap Heap Scan on public.docprimary d2 (cost=8.60..19.31
rows=12
width=8) (actual time=0.023..0.023 rows=0 loops=1)
Output: d2.id, d2.basedon
Recheck Cond: ((d2.basedon = 234409763) OR (d2.id = 234409763))
-> BitmapOr (cost=8.60..8.60 rows=12 width=0) (actual
time=0.018..0.018 rows=0 loops=1)
-> Bitmap Index Scan on basedon_idx (cost=0.00..4.33
rows=11 width=0) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: (d2.basedon = 234409763)
-> Bitmap Index Scan on id_pk (cost=0.00..4.26 rows=1
width=0) (actual time=0.003..0.003 rows=0 loops=1)
Index Cond: (d2.id = 234409763)
-> Index Scan using id_pk on public.docprimary d1 (cost=0.00..3.27
rows=1 width=4) (never executed)
Output: d1.id, d1.basedon
Index Cond: (d1.id = d2.basedon)

--------------------------------------------
PGver: PostgreSQL 9.0b x86
OS: Win7 x64

---------------------
Create table query:
---------------------

CREATE TABLE docprimary
(
id integer NOT NULL,
basedon integer,
CONSTRAINT id_pk PRIMARY KEY (id)
);
CREATE INDEX basedon_idx
ON docprimary
USING btree
(basedon);

--
С уважением,
Зотов Роман Владимирович
руководитель Отдела инструментария
ЗАО "НПО Консультант"
г.Иваново, ул. Палехская, д. 10
тел./факс: (4932) 41-01-21
mailto: zotov(at)oe-it(dot)ru

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Zotov <zotov(at)oe-it(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-20 14:31:01
Message-ID:	AANLkTi=S6GjWxeSnoHOeL4ciBA2LcbR6eEZAyEUMFGLM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 20, 2010 at 1:57 AM, Zotov <zotov(at)oe-it(dot)ru> wrote:
> i wrote to
> pgsql-bugs(at)postgresql(dot)org
> they tell me write to
> pgsql-performance(at)postgresql(dot)org
> they tell me write here
>
> I don`t whant know how optimize query myself (i know it), and i think it
> must do planner.

According to the EXPLAIN ANALYZE output, your "slow" query is
executing in 0.007 ms, and your "fast" query is executing in 0.026 ms
(i.e. not as quickly as the slow query). Since you mention that it
takes 7 s further down, I suspect this is not the real EXPLAIN ANALYZE
output on the real data that you're having a problem with. You might
have better luck if you post the actual EXPLAIN ANALYZE output here.
Incidentally, sorry for not responding sooner to your private email -
I was on vacation last week. But please do keep all replies on-list
so that everyone can comment.

All that having been said, I think the issue here is that the query
planner isn't inferring that d1.ID=<some constant> implies d2.ID=<some
constant>, even though there's a join clause d1.ID=d2.ID. I'm not
really sure why it isn't doing that... I suspect Tom Lane is the only
person who can comment intelligently on that, and he's away this week
(but if anyone else has an idea, feel free to jump in...).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-20 15:23:07
Message-ID:	87wrsq9nz8.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> All that having been said, I think the issue here is that the query
> planner isn't inferring that d1.ID=<some constant> implies d2.ID=<some
> constant>, even though there's a join clause d1.ID=d2.ID.

I think that's what the Equivalence Classes are for. Or at least that's
what they do in my head, not forcibly in the code.

The specific diff between the two queries is :

JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
- WHERE (d1.ID=234409763) or (d2.ID=234409763)
+ WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)

So the OP would appreciate that the planner is able to consider applying
the restriction on d2.BasedOn rather than d1.ID given that d2.BasedOn is
the same thing as d1.ID, from the JOIN.

I have no idea if Equivalence Classes are where to look for this, and if
they're meant to extend up to there, and if that's something possible or
wise to implement, though.

Regards,
--
dim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-20 15:51:34
Message-ID:	AANLkTi=v9-y78g1mP6AG2iQCcwK0vkjZCcWE5fgr+shP@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 20, 2010 at 11:23 AM, Dimitri Fontaine
<dfontaine(at)hi-media(dot)com> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> All that having been said, I think the issue here is that the query
>> planner isn't inferring that d1.ID=<some constant> implies d2.ID=<some
>> constant>, even though there's a join clause d1.ID=d2.ID.
>
> I think that's what the Equivalence Classes are for. Or at least that's
> what they do in my head, not forcibly in the code.
>
> The specific diff between the two queries is :
>
> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
> - WHERE (d1.ID=234409763) or (d2.ID=234409763)
> + WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)
>
> So the OP would appreciate that the planner is able to consider applying
> the restriction on d2.BasedOn rather than d1.ID given that d2.BasedOn is
> the same thing as d1.ID, from the JOIN.
>
> I have no idea if Equivalence Classes are where to look for this, and if
> they're meant to extend up to there, and if that's something possible or
> wise to implement, though.

I was thinking of the equivalence class machinery as well. I think
the OR clause may be the problem. If you just had d1.ID=constant, I
think it would infer that d1.ID, d2.BasedOn, and the constant formed
an equivalence class. But here you obviously can't smash the constant
into the equivalence class, and I think the planner's not smart enough
to consider other ways of applying an equivalent qual. In fact, I
have some recollection that Tom has explicitly rejected adding support
for this in the past, on the grounds that the computation would be too
expensive for the number of queries it would help. Still, it seems to
keep coming up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-20 19:33:51
Message-ID:	m2iq4a0wyo.fsf@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Jul 20, 2010 at 11:23 AM, Dimitri Fontaine
> <dfontaine(at)hi-media(dot)com> wrote:
>> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
>> - WHERE (d1.ID=234409763) or (d2.ID=234409763)
>> + WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)
>
> I was thinking of the equivalence class machinery as well. I think
> the OR clause may be the problem. If you just had d1.ID=constant, I
> think it would infer that d1.ID, d2.BasedOn, and the constant formed
> an equivalence class. But here you obviously can't smash the constant
> into the equivalence class, and I think the planner's not smart enough
> to consider other ways of applying an equivalent qual. In fact, I
> have some recollection that Tom has explicitly rejected adding support
> for this in the past, on the grounds that the computation would be too
> expensive for the number of queries it would help. Still, it seems to
> keep coming up.

Well what I'm thinking now could have nothing to do with how the code
works. I'd have to check, but well, it's easier to write this mail and
get the chance to have you wonder :)

So, the JOIN condition teaches us that d2.BasedOn=d1.ID, and the OP
would want the planner to derive that (d1.ID=234409763) is the same
thing as (d2.BasedOn=234409763). I guess it would make sense to produce
plans with both the writings and pick one based on the costs.

Now, does it make sense to generate this many more plans to analyze in
the general case, I have no idea about. But given only one join and only
one WHERE clause where the Equivalent applies…

Regards,
--
dim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-21 01:00:00
Message-ID:	AANLkTi=bP=UuhEO8=m2ACKHrgDHMVGyHkMH1eS4xtpr9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 20, 2010 at 3:33 PM, Dimitri Fontaine
<dfontaine(at)hi-media(dot)com> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Jul 20, 2010 at 11:23 AM, Dimitri Fontaine
>> <dfontaine(at)hi-media(dot)com> wrote:
>>> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
>>> - WHERE (d1.ID=234409763) or (d2.ID=234409763)
>>> + WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)
>>
>> I was thinking of the equivalence class machinery as well. I think
>> the OR clause may be the problem. If you just had d1.ID=constant, I
>> think it would infer that d1.ID, d2.BasedOn, and the constant formed
>> an equivalence class. But here you obviously can't smash the constant
>> into the equivalence class, and I think the planner's not smart enough
>> to consider other ways of applying an equivalent qual. In fact, I
>> have some recollection that Tom has explicitly rejected adding support
>> for this in the past, on the grounds that the computation would be too
>> expensive for the number of queries it would help. Still, it seems to
>> keep coming up.
>
> Well what I'm thinking now could have nothing to do with how the code
> works. I'd have to check, but well, it's easier to write this mail and
> get the chance to have you wonder :)
>
> So, the JOIN condition teaches us that d2.BasedOn=d1.ID, and the OP
> would want the planner to derive that (d1.ID=234409763) is the same
> thing as (d2.BasedOn=234409763). I guess it would make sense to produce
> plans with both the writings and pick one based on the costs.
>
> Now, does it make sense to generate this many more plans to analyze in
> the general case, I have no idea about. But given only one join and only
> one WHERE clause where the Equivalent applies…

It seems like deciding which rel to apply the filter condition to
would be a fairly expensive optimization. Perhaps we could recognize
the special case where substituting another member of the equivalence
class allows the qual to be pushed down where it otherwise couldn't
be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Sam Mason <sam(at)samason(dot)me(dot)uk>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-21 13:40:47
Message-ID:	20100721134047.GE7584@samason.me.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 20, 2010 at 09:57:06AM +0400, Zotov wrote:
> SELECT d1.ID, d2.ID
> FROM DocPrimary d1
> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
> WHERE (d1.ID=234409763) or (d2.ID=234409763)

You could try rewriting it to:

SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE d1.ID=234409763
UNION
SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE d2.ID=234409763

This should have the same semantics as the original query. I don't
believe PG knows how to do a rewrite like this at the moment.

--
Sam http://samason.me.uk/

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-27 17:37:02
Message-ID:	9993.1280252222@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Jul 20, 2010 at 11:23 AM, Dimitri Fontaine
> <dfontaine(at)hi-media(dot)com> wrote:
>> The specific diff between the two queries is :
>>
>> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
>> - WHERE (d1.ID=234409763) or (d2.ID=234409763)
>> + WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)
>>
>> So the OP would appreciate that the planner is able to consider applying
>> the restriction on d2.BasedOn rather than d1.ID given that d2.BasedOn is
>> the same thing as d1.ID, from the JOIN.
>>
>> I have no idea if Equivalence Classes are where to look for this, and if
>> they're meant to extend up to there, and if that's something possible or
>> wise to implement, though.

> I was thinking of the equivalence class machinery as well. I think
> the OR clause may be the problem. If you just had d1.ID=constant, I
> think it would infer that d1.ID, d2.BasedOn, and the constant formed
> an equivalence class.

Right. Because of the OR, it is *not* possible to conclude that
d2.basedon is always equal to 234409763, which is the implication of
putting them into an equivalence class.

In the example, we do have d1.id and d2.basedon grouped in an
equivalence class. So in principle you could substitute d1.id into the
WHERE clause in place of d2.basedon, once you'd checked that it was
being used with an operator that's compatible with the specific
equivalence class (ie it's in one of the eclass's opfamilies, I think).
The problem is to recognize that such a rewrite would be a win --- it
could just as easily be a big loss.

Even if we understood how to direct the rewriting process, I'm really
dubious that it would win often enough to justify the added planning
time. The particular problem here seems narrow enough that solving it
on the client side is probably a whole lot easier and cheaper than
trying to get the planner to do it.

regards, tom lane

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 07:45:28
Message-ID:	87aapcm4mf.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> In the example, we do have d1.id and d2.basedon grouped in an
> equivalence class. So in principle you could substitute d1.id into the
> WHERE clause in place of d2.basedon, once you'd checked that it was
> being used with an operator that's compatible with the specific
> equivalence class (ie it's in one of the eclass's opfamilies, I think).
> The problem is to recognize that such a rewrite would be a win --- it
> could just as easily be a big loss.

Ok, that was my feeling too.

> Even if we understood how to direct the rewriting process, I'm really
> dubious that it would win often enough to justify the added planning
> time. The particular problem here seems narrow enough that solving it
> on the client side is probably a whole lot easier and cheaper than
> trying to get the planner to do it.

My overly naive uneducated idea here would be to produce both the plans
and let the planner evaluate their respective costs. Maybe that's what
you mean here by "how to direct the rewriting process". Then we don't
want to generate too many useless plans when you have lots of eclass
around.

This brings back the idea of pondering somehow the optimiser effort
pushed into "solving" a query plan. Like in gcc we can use different
effort targets and we don't know for sure before hand if -O3 will
produce faster code than -O2, all we know is that it will try harder.

Is it possible to imagine having a plan_eclass_permutations default to
false that would activate the discussed behavior here? Ok, I'm not sure
what form should take such a setting, but clearly, there's a need to be
able to impact the optimiser effort.

Regards,
--
dim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 10:44:06
Message-ID:	AANLkTi=5pNrtgMRvOywQLd4Jg-Pya89s22pycZxUYvZo@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 28, 2010 at 3:45 AM, Dimitri Fontaine
<dfontaine(at)hi-media(dot)com> wrote:
>> Even if we understood how to direct the rewriting process, I'm really
>> dubious that it would win often enough to justify the added planning
>> time. The particular problem here seems narrow enough that solving it
>> on the client side is probably a whole lot easier and cheaper than
>> trying to get the planner to do it.
>
> My overly naive uneducated idea here would be to produce both the plans
> and let the planner evaluate their respective costs. Maybe that's what
> you mean here by "how to direct the rewriting process". Then we don't
> want to generate too many useless plans when you have lots of eclass
> around.

The way the planner is set up, you'd have to plan with qual A, then
repeat the entire process with qual B, and then just for good measure
repeat the process with both quals A and B. ISTM you'd triple the
planning time if there were even just one case of this in a particular
query. If you have different ways of generating the same output for a
given rel, you can just throw them all into a bucket and let the
planner work it out. But here you want to have different paths for
the same relation that generate *different output*, and the planner
doesn't understand that concept.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 10:55:56
Message-ID:	878w4vlvsz.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> But here you want to have different paths for
> the same relation that generate *different output*, and the planner
> doesn't understand that concept.

Sorry? I though what Equivalence Class provides is the "proving" that
using this qualification or another will *not* affect the output.
--
dim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 11:13:44
Message-ID:	AANLkTikGSh7jaEF3EoozKb56wV2uEt=Df0CFjfrABP3r@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 28, 2010 at 6:55 AM, Dimitri Fontaine
<dfontaine(at)hi-media(dot)com> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> But here you want to have different paths for
>> the same relation that generate *different output*, and the planner
>> doesn't understand that concept.
>
> Sorry? I though what Equivalence Class provides is the "proving" that
> using this qualification or another will *not* affect the output.

In a query like...

SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d1.ID=234409763) or (d2.ID=234409763)

...you're going to scan d1, scan d2, and then join the results. The
scan of d1 is going to produce different results depending on whether
you evaluate or not d1.ID=234409763, and the scan of d2 is going to
produce different results depending on whether or not you evaluate
d2.BasedOn=234409763.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 11:24:56
Message-ID:	4C501388.3010301@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Wed, Jul 28, 2010 at 6:55 AM, Dimitri Fontaine
> <dfontaine(at)hi-media(dot)com> wrote:
>
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>
>>> But here you want to have different paths for
>>> the same relation that generate *different output*, and the planner
>>> doesn't understand that concept.
>>>
>> Sorry? I though what Equivalence Class provides is the "proving" that
>> using this qualification or another will *not* affect the output.
>>
>
> In a query like...
>
> SELECT d1.ID, d2.ID
> FROM DocPrimary d1
> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
> WHERE (d1.ID=234409763) or (d2.ID=234409763)
>
> ...you're going to scan d1, scan d2, and then join the results. The
> scan of d1 is going to produce different results depending on whether
> you evaluate or not d1.ID=234409763, and the scan of d2 is going to
> produce different results depending on whether or not you evaluate
> d2.BasedOn=234409763.
>
Wouldn't it be relatively easy, to rewrite the filter expression by
adding expressions, instead of replacing constants, in the disjunctive
case, so the example at hand would become:

WHERE (d1.ID=234409763) or (d2.ID=234409763)
AND (d2.BasedOnID=234409763) or (d2.ID=234409763)

regards,
Yeb Havinga

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 11:30:48
Message-ID:	AANLkTi=UTjEsnjcfz=C2kkf65ufq7-6+fQ97yMKeMWyk@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 28, 2010 at 7:24 AM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
>>> Sorry? I though what Equivalence Class provides is the "proving" that
>>> using this qualification or another will *not* affect the output.
>>
>> In a query like...
>>
>> SELECT d1.ID, d2.ID
>> FROM DocPrimary d1
>> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
>> WHERE (d1.ID=234409763) or (d2.ID=234409763)
>>
>> ...you're going to scan d1, scan d2, and then join the results. The
>> scan of d1 is going to produce different results depending on whether
>> you evaluate or not d1.ID=234409763, and the scan of d2 is going to
>> produce different results depending on whether or not you evaluate
>> d2.BasedOn=234409763.
>
> Wouldn't it be relatively easy, to rewrite the filter expression by adding
> expressions, instead of replacing constants, in the disjunctive case, so the
> example at hand would become:
>
> WHERE (d1.ID=234409763) or (d2.ID=234409763)
> AND (d2.BasedOnID=234409763) or (d2.ID=234409763)

Yeah, that could be done, but it's not necessarily a win from a
performance standpoint.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 11:31:39
Message-ID:	871vanlu5g.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> SELECT d1.ID, d2.ID
> FROM DocPrimary d1
> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
> WHERE (d1.ID=234409763) or (d2.ID=234409763)
>
> ...you're going to scan d1, scan d2, and then join the results. The
> scan of d1 is going to produce different results depending on whether
> you evaluate or not d1.ID=234409763, and the scan of d2 is going to
> produce different results depending on whether or not you evaluate
> d2.BasedOn=234409763.

Well I just realised you can't use d2.BasedOn in scanning d1 here. I
don't know what exactly I had in mind previously, but in any case, sorry
for the noise.

I hope the optimiser effort control still hold water nonetheless…
--
dim

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 12:02:54
Message-ID:	4C501C6E.6030108@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Wed, Jul 28, 2010 at 7:24 AM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
>
>>>> Sorry? I though what Equivalence Class provides is the "proving" that
>>>> using this qualification or another will *not* affect the output.
>>>>
>>> In a query like...
>>>
>>> SELECT d1.ID, d2.ID
>>> FROM DocPrimary d1
>>> JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
>>> WHERE (d1.ID=234409763) or (d2.ID=234409763)
>>>
>>> ...you're going to scan d1, scan d2, and then join the results. The
>>> scan of d1 is going to produce different results depending on whether
>>> you evaluate or not d1.ID=234409763, and the scan of d2 is going to
>>> produce different results depending on whether or not you evaluate
>>> d2.BasedOn=234409763.
>>>
>> Wouldn't it be relatively easy, to rewrite the filter expression by adding
>> expressions, instead of replacing constants, in the disjunctive case, so the
>> example at hand would become:
>>
>> WHERE (d1.ID=234409763) or (d2.ID=234409763)
>> AND (d2.BasedOnID=234409763) or (d2.ID=234409763)
>>
>
> Yeah, that could be done, but it's not necessarily a win from a
> performance standpoint.
>
Not necessarily a win, but on the test case no significant increase in
planning time. It somehow feels like a good idea to give the planner as
much information as possible, i.e. for each rel as much baserestrictinfo's.

I earlier forgot parentheses, the correct query is

SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE ((d1.ID=234409763) or (d2.ID=234409763))
AND ((d2.BasedOn=234409763) or (d2.ID=234409763));

by doing this in the rewrite step, triple planning would be avoided. I
suspect that a copyObject of the expression + expression tree mutator
call time during rewrite is negligible compared to plan time, assuming
this is minimal, in this particulare case there doesn't seem to be much
planning time between the three variants.

I ran the script below a number of times, the third time is the one with
expanded expression:

Time: 0.820 ms
Time: 0.859 ms
Time: 0.877 ms
---
Time: 0.617 ms
Time: 0.662 ms
Time: 0.737 ms
---
Time: 0.817 ms
Time: 0.766 ms
Time: 0.826 ms
---
Time: 0.638 ms
Time: 0.700 ms
Time: 0.706 ms
---
Time: 0.463 ms
Time: 0.847 ms
Time: 0.793 ms
---
Time: 0.629 ms
Time: 0.671 ms
Time: 0.703 ms

this was the script (on the relation and index supplied by the OP)

-- warm catalog
explain SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d1.ID=234409763) or (d2.ID=234409763);

\timing

explain SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d1.ID=234409763) or (d2.ID=234409763);

explain SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d2.BasedOn=234409763) or (d2.ID=234409763);

explain SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE ((d1.ID=234409763) or (d2.ID=234409763))
AND ((d2.BasedOn=234409763) or (d2.ID=234409763));

From:	Zotov <zotov(at)oe-it(dot)ru>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: *[SPAM](8.2)** Re: Query optimization problem
Date:	2010-07-28 12:19:29
Message-ID:	4C502051.8020001@oe-it.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

27.07.2010 21:37, Tom Lane пишет:
> Right. Because of the OR, it is *not* possible to conclude that
> d2.basedon is always equal to 234409763, which is the implication of
> putting them into an equivalence class.
>
> In the example, we do have d1.id and d2.basedon grouped in an
> equivalence class. So in principle you could substitute d1.id into the
> WHERE clause in place of d2.basedon, once you'd checked that it was
> being used with an operator that's compatible with the specific
> equivalence class (ie it's in one of the eclass's opfamilies, I think).
> The problem is to recognize that such a rewrite would be a win --- it
> could just as easily be a big loss.
>
> Even if we understood how to direct the rewriting process, I'm really
> dubious that it would win often enough to justify the added planning
> time. The particular problem here seems narrow enough that solving it
> on the client side is probably a whole lot easier and cheaper than
> trying to get the planner to do it.
>
> regards, tom lane
>
So sorry, Tom. As I can understand you. You wouldn`t do something about
it. I think, what this problem can show class of optimization problems.
This query:
*SLOW*

SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (*d1.ID=234409763* and *d2.BasedOn=d1.id*
) OR (d2.ID=234409763);

*FAST*

SELECT d1.ID, d2.ID
FROM DocPrimary d1
JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (*d1.ID=234409763* and *d2.BasedOn=234409763*
) OR (d2.ID=234409763);

If i use constant obvious, it works use fast plan. I think query
optimizer can do this.
I hope you do something to make this query faster/
Thank You.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Zotov <zotov(at)oe-it(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimization problem
Date:	2010-07-28 14:34:56
Message-ID:	13738.1280327696@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Yeb Havinga <yebhavinga(at)gmail(dot)com> writes:
> Robert Haas wrote:
>> On Wed, Jul 28, 2010 at 7:24 AM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
>>> Wouldn't it be relatively easy, to rewrite the filter expression by adding
>>> expressions, instead of replacing constants, in the disjunctive case, so the
>>> example at hand would become:
>>>
>>> WHERE (d1.ID=234409763) or (d2.ID=234409763)
>>> AND (d2.BasedOnID=234409763) or (d2.ID=234409763)

>> Yeah, that could be done, but it's not necessarily a win from a
>> performance standpoint.

> Not necessarily a win, but on the test case no significant increase in
> planning time.

The problem is that it could cost you a lot in execution time, because
of the useless extra filter conditions that will be applied. The
planner isn't going to notice that those conditions are redundant.
An even worse problem is that because it doesn't know that, it's going
to underestimate the combined selectivity of the two WHERE conditions,
resulting in drastic underestimates of the numbers of rows emitted,
possibly resulting in horribly bad plan choices that kill whatever
performance improvement you got at the bottom level.

What the EquivalenceClass machinery actually buys us is the ability to
deal with a set of partially-redundant possible filter conditions and
apply only enough of them to get a correct plan. As an example, if the
query has A=B and B=C, we could deduce A=C, but we don't want to apply
all three equality conditions at runtime. Instead we put all three
variables into an EC, and then there is logic to determine which of the
equality clauses implied by the EC should actually get applied where.
This avoids both the useless-checks-at-runtime problem and the problem
of wrong selectivity estimates.

To do something like this without generating stupid plans, we'd need
some sort of generalized EC mechanism that could figure out which
variants of the clause made the most sense in different contexts.
Or maybe something else entirely --- but just generating a lot of
variants of a clause and throwing them all into the existing mechanism
is not workable.

regards, tom lane