Quick Links

Re: Improve OR conditions on joined columns.

Lists:	pgsql-hackers

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Improve OR conditions on joined columns.
Date:	2017-02-08 22:07:40
Message-ID:	7f70bd5a-5d16-e05c-f0b4-2fdfc8873489@BlueTreble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've a client interested enough in $SUBJECT that they're willing to
offer a bounty on it. An example of the pain is (working example attached):

create temp view denorm as
select f.*, d1.t t1, d2.t t2
from fact f
left join dim d1 on f1=d1.id
left join dim d2 on f2=d2.id
;

-- Fast
explain analyze select count(*) from denorm where 1 in (f1,f2);
explain analyze select count(*) from denorm where '1' in (t1);

-- Slow
explain analyze select count(*) from denorm where '1' in (t1,t2);

They currently work around this by doing a union:

select ... from denorm where t1 = '1'
union
select ... from denorm where t2 = '1'

... or depending on needs using IN instead of =.

AFAICT this can be transformed into a UNION (not all) if dim.id is
unique. Does the upper planner pathification make this any easier?
There's another transform using arrays that's possible as well (see
attached example); I believe that would work regardless of uniqueness.

Just to be clear; the OR by itself is not a problem (as shown by the
first fast query); it's the OR with the JOIN that's a problem.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Attachment	Content-Type	Size
test.sql	text/plain	1.5 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns.
Date:	2017-02-08 23:54:43
Message-ID:	1397.1486598083@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> writes:
> AFAICT this can be transformed into a UNION (not all) if dim.id is
> unique. Does the upper planner pathification make this any easier?

What I did in 9.6 is a first step. The next step, I think, is to
replace prepunion.c with something that can consider more than one
implementation path for a union.

Although ... actually, that may not be the bottleneck for what you're
after. The issue here is not so much discovering a clever plan for
a union as realizing that the query could be cast as a union in the
first place.

Maybe it'd be better to imagine this as something closer to planagg.c,
that is it knows how to apply a specific high-level optimization that
might or might not be a win, so it builds a path describing that and sees
if it looks cheaper than a path done the normal way. The fact that
we could even build a path describing a union is something that wasn't
there before 9.6, but maybe there's enough infrastructure for that now.

> There's another transform using arrays that's possible as well (see
> attached example); I believe that would work regardless of uniqueness.

That doesn't look terribly promising for automated application.
And I think it's really dependent on the exact shape of the OR
clause, which is an unpleasant limitation. Considering the amount
of work this will take to do at all, you'd want it to be pretty
general I think. I'm imagining something that would look for an
OR in which each clause referenced only one rel, then if it can
identify a way to re-unique-ify the result, split into a UNION
with an arm for each rel used in the OR. The nature of the
conditions in each OR arm don't seem to matter ... though probably
you'd have to insist on there not being volatile conditions anywhere
in the query, because they'd get evaluated too many times.

regards, tom lane

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-10 00:50:02
Message-ID:	dfa004d8-6b58-d006-bf90-f2d854838e2c@BlueTreble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/8/17 5:54 PM, Tom Lane wrote:
> Although ... actually, that may not be the bottleneck for what you're
> after. The issue here is not so much discovering a clever plan for
> a union as realizing that the query could be cast as a union in the
> first place.

Right; their current workaround is to manually write a UNION. That's
*significantly* faster than the OR.

> Maybe it'd be better to imagine this as something closer to planagg.c,
> that is it knows how to apply a specific high-level optimization that
> might or might not be a win, so it builds a path describing that and sees
> if it looks cheaper than a path done the normal way. The fact that
> we could even build a path describing a union is something that wasn't
> there before 9.6, but maybe there's enough infrastructure for that now.

It's encouraging that there's at least a template to follow there... If
there is missing infrastructure, is it likely to be major? My uninformed
guess would be that the pathification was the major hurdle.

>> There's another transform using arrays that's possible as well (see
>> attached example); I believe that would work regardless of uniqueness.
>
> That doesn't look terribly promising for automated application.
> And I think it's really dependent on the exact shape of the OR
> clause, which is an unpleasant limitation. Considering the amount

Not sure what you mean by shape; do you mean whether the OR conditions
are rels vs Consts or Vars?

> of work this will take to do at all, you'd want it to be pretty
> general I think. I'm imagining something that would look for an
> OR in which each clause referenced only one rel, then if it can
> identify a way to re-unique-ify the result, split into a UNION
> with an arm for each rel used in the OR. The nature of the
> conditions in each OR arm don't seem to matter ... though probably
> you'd have to insist on there not being volatile conditions anywhere
> in the query, because they'd get evaluated too many times.

In my experience, the common use case here is a large fact table that
joins to a number of dimension tables. If you can filter the large fact
table down to a fairly small size, those joins don't matter much. But if
a big chunk of your filter comes from the joins, you're in trouble.

UNION isn't really a great way to solve this problem because you still
end up doing a full join just to run the filter, but really all that you
need are the values from the dimension table(s) that you need for the
filter. IOW, change

WHERE t1 IN ('a','b') OR t2 IN ('c','d')

into

WHERE f1 IN (1,2) OR f2 IN (3,4)

(assuming a,b,c,d maps to 1,2,3,4)

BTW, there's an important caveat here: users generally do NOT want
duplicate rows from the fact table if the dimension table results aren't
unique. I thought my array solution was equivalent to what the JOINs
would do in that case but that's actually wrong. The array solution does
provide the behavior users generally want here though. JOIN is the
easiest tool to pick up for this, so it's what people gravitate to, but
I suspect most users would be happier with a construct that worked like
the array trick does, but was easier to accomplish.

I wonder if any other databases have come up with non-standard syntax to
do this.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-10 06:18:27
Message-ID:	CAGTBQpZ0JYHqsLikPB=ZB+Qzvqu5pjRcpeY97Zmf+g8dMKEmug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 9, 2017 at 9:50 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
> WHERE t1 IN ('a','b') OR t2 IN ('c','d')
>
> into
>
> WHERE f1 IN (1,2) OR f2 IN (3,4)
>
> (assuming a,b,c,d maps to 1,2,3,4)
>
> BTW, there's an important caveat here: users generally do NOT want duplicate
> rows from the fact table if the dimension table results aren't unique. I
> thought my array solution was equivalent to what the JOINs would do in that
> case but that's actually wrong. The array solution does provide the behavior
> users generally want here though. JOIN is the easiest tool to pick up for
> this, so it's what people gravitate to, but I suspect most users would be
> happier with a construct that worked like the array trick does, but was
> easier to accomplish.
>
> I wonder if any other databases have come up with non-standard syntax to do
> this.

What I've been doing is do those transforms (tn -> fn) in application
code. While it's a chore, the improvement in plans is usually well
worth the trouble.

IF there's a FK between fact and dimension tables, you can be certain
the transform will yield equivalent results, becuase you'll be certain
the joins don't duplicate rows.

So the transform should be rather general and useful.

If you have a join of the form:

a join b on a.f1 = b.id

Where a.f1 has an FK referencing b.id, and a filter on b X of any
form, you can turn the plan into:

with b_ids as (select id from b where X)
...
a join b on a.f1 = b.id and a.f1 in (select id from b_ids)

In order to be useful, the expected row count from b_ids should be rather small.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-12 00:30:35
Message-ID:	4343.1486859435@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> writes:
> On 2/8/17 5:54 PM, Tom Lane wrote:
>> Maybe it'd be better to imagine this as something closer to planagg.c,
>> that is it knows how to apply a specific high-level optimization that
>> might or might not be a win, so it builds a path describing that and sees
>> if it looks cheaper than a path done the normal way. The fact that
>> we could even build a path describing a union is something that wasn't
>> there before 9.6, but maybe there's enough infrastructure for that now.

> It's encouraging that there's at least a template to follow there... If
> there is missing infrastructure, is it likely to be major? My uninformed
> guess would be that the pathification was the major hurdle.

I wrote a POC patch for this on a long airplane ride. It's not complete,
and I'm sure there are bugs as well, but it makes your example case
better. What I did about the de-duplication issue is to de-dup using
the CTIDs of all baserels in the query. This means the optimization
is only applicable to cases where all the rels have CTIDs ... but other
methods such as inspecting unique keys ain't gonna work for non-table
rels either, so I think this is about the best we can hope for.
However, I did not understand your point about:

> BTW, there's an important caveat here: users generally do NOT want
> duplicate rows from the fact table if the dimension table results aren't
> unique.

so maybe we should be thinking about some other way entirely?

Anyway, what's attached below produces this on your example query:

Aggregate (cost=38.12..38.13 rows=1 width=8) (actual time=0.158..0.158 rows=1 loops=1)
-> Result (cost=38.03..38.11 rows=4 width=0) (actual time=0.151..0.155 rows=3 loops=1)
-> Unique (cost=38.03..38.07 rows=4 width=18) (actual time=0.150..0.154 rows=3 loops=1)
-> Sort (cost=38.03..38.04 rows=4 width=18) (actual time=0.150..0.151 rows=4 loops=1)
Sort Key: f.ctid, d1.ctid, d2.ctid
Sort Method: quicksort Memory: 25kB
-> Append (cost=4.85..37.99 rows=4 width=18) (actual time=0.070..0.106 rows=4 loops=1)
-> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.069..0.075 rows=2 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.035..0.038 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d1 (cost=0.28..8.29 rows=1 width=10) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.023..0.025 rows=2 loops=1)
Recheck Cond: (f1 = d1.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f1 (cost=0.00..4.29 rows=2 width=0) (actual time=0.016..0.016 rows=2 loops=1)
Index Cond: (f1 = d1.s)
-> Index Scan using dim_pkey on dim d2 (cost=0.28..0.31 rows=1 width=10) (actual time=0.016..0.017 rows=0 loops=2)
Index Cond: (f.f2 = s)
-> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.025..0.029 rows=2 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.022..0.025 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d2 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.017..0.019 rows=2 loops=1)
Recheck Cond: (f2 = d2.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f2 (cost=0.00..4.29 rows=2 width=0) (actual time=0.014..0.014 rows=2 loops=1)
Index Cond: (f2 = d2.s)
-> Index Scan using dim_pkey on dim d1 (cost=0.28..0.31 rows=1 width=10) (actual time=0.001..0.001 rows=0 loops=2)
Index Cond: (f.f1 = s)

The main remaining piece of work here is that, as you can see from the
above, it fails to eliminate joins to tables that we don't actually need
in a particular UNION arm. This is because the references to those
tables' ctid columns prevent analyzejoins.c from removing the joins.
I've thought about ways to deal with that but haven't come up with
anything that wasn't pretty ugly and/or invasive.

regards, tom lane

Attachment	Content-Type	Size
join-or-to-union-1.patch	text/x-diff	23.2 KB

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-12 11:51:26
Message-ID:	CAKJS1f9OPfZ4n9ozTeDMLpq6y10Ms3dVgQnepZ8xC0LyEcLNdQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12 February 2017 at 13:30, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote a POC patch for this on a long airplane ride. It's not complete,
> and I'm sure there are bugs as well, but it makes your example case
> better. What I did about the de-duplication issue is to de-dup using
> the CTIDs of all baserels in the query. This means the optimization
> is only applicable to cases where all the rels have CTIDs ... but other
> methods such as inspecting unique keys ain't gonna work for non-table
> rels either, so I think this is about the best we can hope for.
> However, I did not understand your point about:

This is very interesting. Couldn't this be even more generic and
instead of looking at just the jointree quals, also look at the join
conditions too, as I think you can use this to also transforms queries
such as:

select * from t1 inner join t2 on t1.a = t2.a or t1.b = t2.b;

I imagine you'd also want an enable_unionor GUC which can be used to
disable this for when the planner makes a poor choice.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-12 17:32:36
Message-ID:	10353.1486920756@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
> This is very interesting. Couldn't this be even more generic and
> instead of looking at just the jointree quals, also look at the join
> conditions too, as I think you can use this to also transforms queries
> such as:

The hard part of that is to remember exactly where you need to do surgery
on the querytree to replace the OR clause, keeping in mind that a tree
copy step is going to happen in between first searching for the OR and
doing the replacement. I'm sure it's soluble but it would've involved
more complexity than I felt justified for a POC.

> I imagine you'd also want an enable_unionor GUC which can be used to
> disable this for when the planner makes a poor choice.

It's not so much poor choices as the cost of the optimization attempt ---
if there's a K-relation OR clause, this will increase the cost of planning
by a factor approaching K+1, whether or not you get a better plan out of
it. I ran the regression tests with some instrumentation and determined
that this logic fires a dozen or two times, and fails to produce a plan
that looks cheaper than the standard plan in any of those cases. So if we
go down this road, not only do we need a GUC but I suspect it had better
default to off; only people using star schemas are really likely to get a
win out of it.

Maybe we could improve that situation by applying some heuristics that
prevent the optimization from being attempted unless it's more likely to
produce a win. But I'm not sure what that would look like. We have to
make the decision pretty early and without a lot of information. As an
example, Jim's case fails completely if we don't do the transformation
before reduce_outer_joins, because one of the things that *has* to happen
to get a desirable plan is to recognize that the extracted restriction
clause allows the left join to the dimension table to be simplified to an
inner join.

Having written the POC, I find myself not terribly satisfied with it.
It seems brute-force and not at all integrated with the rest of the
planner. Still, asking for integration with the existing UNION planning
is probably silly, since that really needs to be thrown away and rewritten.
Maybe this is a reasonable way forward until we get a clearer view of
what that area needs to look like.

regards, tom lane

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-12 23:06:31
Message-ID:	CAKJS1f8u=+x0W30PeGZY9P+JEW9=Q7nQ8Y2K1eiyWqTRMBAGoQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13 February 2017 at 06:32, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> It's not so much poor choices as the cost of the optimization attempt ---
> if there's a K-relation OR clause, this will increase the cost of planning
> by a factor approaching K+1, whether or not you get a better plan out of
> it. I ran the regression tests with some instrumentation and determined
> that this logic fires a dozen or two times, and fails to produce a plan
> that looks cheaper than the standard plan in any of those cases. So if we
> go down this road, not only do we need a GUC but I suspect it had better
> default to off; only people using star schemas are really likely to get a
> win out of it.

I always try to shy away from assuming that the regression test suite
is a good reflection of a real world set of queries. It's full of
tests for edge cases that are rarely seen in reality. FWIW I did a
similar experiment with unique joins and was disappointed to see that
it didn't apply in more cases. Yet I've worked with OLTP applications
since 2005, and I struggle to recall any many:many joins at all.

Perhaps this optimisation is a candidate for only being applied when
some sort of planner_strength GUC (as mentioned in FOSDEM developer
meeting in 2016) reaches some threshold. There's certainly already
some planner smarts that can be skipped when such a GUC is set to a
lower level (e.g join removal). We could likely save many cycles if we
had the ability to re-plan queries where total_cost > X with more
smarts enabled.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-13 01:15:35
Message-ID:	a99a4034-02df-b6c5-765d-53ccb5efcff0@BlueTreble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/12/17 5:06 PM, David Rowley wrote:
> Yet I've worked with OLTP applications
> since 2005, and I struggle to recall any many:many joins at all.

Interesting... I've run across it numerous times. In any case, for OLTP
there's other things you can do fairly easily.

> Perhaps this optimisation is a candidate for only being applied when
> some sort of planner_strength GUC (as mentioned in FOSDEM developer
> meeting in 2016) reaches some threshold. There's certainly already
> some planner smarts that can be skipped when such a GUC is set to a
> lower level (e.g join removal). We could likely save many cycles if we
> had the ability to re-plan queries where total_cost > X with more
> smarts enabled.

Yeah, I strongly suspect some kind of "multi-stage" planner would be a
big win.

As for the POC, that's the same kind of plan I'm seeing IRL: a nested
loop gets used essentially to do the lookup of dimension text to
dimension ID.

I'm wondering if there's any tricks that could be applied on the sort
since it's dealing with CTIDs.

I do think it'd be even better if we had the ability to do that lookup
as part of planning, so you could discover the exact stats for the
relevant ID values, but that's even more involved.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-14 19:18:54
Message-ID:	22002.1487099934@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> The main remaining piece of work here is that, as you can see from the
> above, it fails to eliminate joins to tables that we don't actually need
> in a particular UNION arm. This is because the references to those
> tables' ctid columns prevent analyzejoins.c from removing the joins.
> I've thought about ways to deal with that but haven't come up with
> anything that wasn't pretty ugly and/or invasive.

I thought of a way that wasn't too awful, which was to inject the requests
for CTID columns only after we've done join removal. The attached v2
patch produces this for your test case:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=36.84..36.85 rows=1 width=8) (actual time=0.075..0.075 rows=1 loops=1)
-> Result (cost=36.77..36.83 rows=4 width=0) (actual time=0.069..0.073 rows=3 loops=1)
-> Unique (cost=36.77..36.79 rows=4 width=6) (actual time=0.069..0.072 rows=3 loops=1)
-> Sort (cost=36.77..36.78 rows=4 width=6) (actual time=0.068..0.069 rows=4 loops=1)
Sort Key: f.ctid
Sort Method: quicksort Memory: 25kB
-> Append (cost=4.57..36.73 rows=4 width=6) (actual time=0.018..0.033 rows=4 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=6) (actual time=0.018..0.020 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d1 (cost=0.28..8.29 rows=1 width=10) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.010..0.012 rows=2 loops=1)
Recheck Cond: (f1 = d1.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f1 (cost=0.00..4.29 rows=2 width=0) (actual time=0.006..0.006 rows=2 loops=1)
Index Cond: (f1 = d1.s)
-> Nested Loop (cost=4.57..18.37 rows=2 width=6) (actual time=0.009..0.011 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d2 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.005..0.006 rows=2 loops=1)
Recheck Cond: (f2 = d2.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f2 (cost=0.00..4.29 rows=2 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: (f2 = d2.s)
Planning time: 0.732 ms
Execution time: 0.156 ms
(25 rows)

I think this might be code-complete, modulo the question of whether we
want an enabling GUC for it. I'm still concerned about the question
of whether it adds more planning time than it's worth for most users.
(Obviously it needs some regression test cases too, and a lot more
real-world testing than I've given it.)

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

Potential future work includes finding join ORs in upper-level INNER JOIN
ON clauses, and improving matters so we can do the unique-ification with
hashing as well as sorting. I don't think either of those things has to
be in the first commit, though.

regards, tom lane

Attachment	Content-Type	Size
join-or-to-union-2.patch	text/x-diff	29.8 KB

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-14 20:47:37
Message-ID:	d1f6fffd-4c53-2e4d-c5b2-037ddb7a2fe4@BlueTreble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/14/17 1:18 PM, Tom Lane wrote:
> I think this might be code-complete, modulo the question of whether we
> want an enabling GUC for it. I'm still concerned about the question
> of whether it adds more planning time than it's worth for most users.
> (Obviously it needs some regression test cases too, and a lot more
> real-world testing than I've given it.)

Don't we normally provide a GUC for this level of query manipulation? We
can always remove it later if evidence shows there's not sufficient
downside (perhaps after gaining the ability for the planner to do extra
work on queries that appear to be large enough to warrant it).

> One point that could use further review is whether the de-duplication
> algorithm is actually correct. I'm only about 95% convinced by the
> argument I wrote in planunionor.c's header comment.

Another reason for a GUC...

I'll put some thought into it and see if I can find any holes. Are you
only worried about the removal of "useless" rels or is there more?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-02-14 21:03:24
Message-ID:	26006.1487106204@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> writes:
> On 2/14/17 1:18 PM, Tom Lane wrote:
>> One point that could use further review is whether the de-duplication
>> algorithm is actually correct. I'm only about 95% convinced by the
>> argument I wrote in planunionor.c's header comment.

> I'll put some thought into it and see if I can find any holes. Are you
> only worried about the removal of "useless" rels or is there more?

Well, the key point is whether it's really OK to de-dup on the basis
of only the CTIDs that are not eliminated in any UNION arm. I was
feeling fairly good about that until I thought of the full-join-to-
left-join-to-no-join conversion issue mentioned in the comment.
Now I'm wondering if there are other holes; or maybe I'm wrong about
that one and it's not necessary to be afraid of full joins.

regards, tom lane

From:	David Steele <david(at)pgmasters(dot)net>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-03-16 16:54:37
Message-ID:	c7703b14-cb6a-f9e0-2d10-c370bdc6ad7f@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/14/17 4:03 PM, Tom Lane wrote:
> Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> writes:
>> On 2/14/17 1:18 PM, Tom Lane wrote:
>>> One point that could use further review is whether the de-duplication
>>> algorithm is actually correct. I'm only about 95% convinced by the
>>> argument I wrote in planunionor.c's header comment.
>
>> I'll put some thought into it and see if I can find any holes. Are you
>> only worried about the removal of "useless" rels or is there more?
>
> Well, the key point is whether it's really OK to de-dup on the basis
> of only the CTIDs that are not eliminated in any UNION arm. I was
> feeling fairly good about that until I thought of the full-join-to-
> left-join-to-no-join conversion issue mentioned in the comment.
> Now I'm wondering if there are other holes; or maybe I'm wrong about
> that one and it's not necessary to be afraid of full joins.

This patch applies cleanly (with offsets) and compiles at cccbdde.

Jim, have you had time to think about this? Any insights?

--
-David
david(at)pgmasters(dot)net

From:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-03-19 00:29:34
Message-ID:	3605fcb2-6622-6520-b678-83171fe4c9d8@openscg.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/16/17 11:54 AM, David Steele wrote:
> On 2/14/17 4:03 PM, Tom Lane wrote:
>> Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> writes:
>>> On 2/14/17 1:18 PM, Tom Lane wrote:
>>>> One point that could use further review is whether the de-duplication
>>>> algorithm is actually correct. I'm only about 95% convinced by the
>>>> argument I wrote in planunionor.c's header comment.
>>
>>> I'll put some thought into it and see if I can find any holes. Are you
>>> only worried about the removal of "useless" rels or is there more?
>>
>> Well, the key point is whether it's really OK to de-dup on the basis
>> of only the CTIDs that are not eliminated in any UNION arm. I was
>> feeling fairly good about that until I thought of the full-join-to-
>> left-join-to-no-join conversion issue mentioned in the comment.
>> Now I'm wondering if there are other holes; or maybe I'm wrong about
>> that one and it's not necessary to be afraid of full joins.
>
> This patch applies cleanly (with offsets) and compiles at cccbdde.
>
> Jim, have you had time to think about this? Any insights?

Having thought about it, I share Tom's concerns. And now I'm worried
about what happens if there are multiple separate OR clauses. I guess
those would be handled by separate UNIONs?

I'm also finding it a bit hard to follow the comment Tom mentioned. I'm
pretty sure I understand what's going on until this part:

> The identical proof can be expected to apply
> + * in other arms, except in an arm that references that rel in its version
> + * of the OR clause. But in such an arm, we have effectively added a
> + * restriction clause to what is known in other arms, which means that the
> + * set of rows output by that rel can't increase compared to other arms.

AIUI, this is describing a case something like this:

SELECT child.blah FROM child LEFT JOIN parent USING(parent_id)
WHERE child.foo AND (child.baz=1 or child.baz=2)

given that parent.parent_id is unique. Except for these concerns, there
would need to be a complex OR somewhere in here that sometimes
referenced parent and sometimes didn't, such as

WHERE child.foo AND (child.baz=1 OR parent.foo=3)

But I'm not following the logic here (very possibly because I'm wrong
about what I said above):

> + * Therefore the situation in such an arm must be that including the rel
> + * could result in either zero or one output row, rather than exactly one
> + * output row as in other arms. So we still don't need to consider it for
> + * de-duplication.

I'm definitely not certain about the rest of it.

Tom, could you expand the description some, especially with some examples?
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
Cc:	David Steele <david(at)pgmasters(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-03-19 19:32:57
Message-ID:	30756.1489951977@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
> On 3/16/17 11:54 AM, David Steele wrote:
>> On 2/14/17 4:03 PM, Tom Lane wrote:
>>> Well, the key point is whether it's really OK to de-dup on the basis
>>> of only the CTIDs that are not eliminated in any UNION arm. I was
>>> feeling fairly good about that until I thought of the full-join-to-
>>> left-join-to-no-join conversion issue mentioned in the comment.

>> Jim, have you had time to think about this? Any insights?

> Having thought about it, I share Tom's concerns. And now I'm worried
> about what happens if there are multiple separate OR clauses. I guess
> those would be handled by separate UNIONs?

As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.

> Tom, could you expand the description some, especially with some examples?

Here's a counterexample --- admittedly rather artificial, but still a
counterexample --- to applying the optimization when there are FULL joins.
Consider

SELECT count(*)
FROM a FULL JOIN b ON (a.id = b.id)
WHERE (a.x = 42 OR b.y = 43);

and suppose that a and b have mutual FK constraints guaranteeing that
every non-null a.id value has exactly one match in b and vice versa. (For
the purposes of this example, I think it doesn't matter whether or not we
allow there to be null id values.) Also suppose that there are some join
rows satisfying both a.x = 42 and b.y = 43, while other join rows satisfy
only one of the OR arms. The patch would try to implement this query as,
approximately,

SELECT count(*)
FROM
( SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE a.x = 42
UNION-using-ctids
SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE b.y = 43 )

Now in the first UNION arm, we'd observe that the strict a.x = 42
condition allows reduction of the join strength to LEFT JOIN, and then
we'd deduce from the FK on b that the join to b can be dropped altogether.
In the second arm, we'd similarly conclude that a can be dropped
altogether, leaving

SELECT count(*)
FROM
( SELECT FROM a WHERE a.x = 42
UNION-using-ctids
SELECT FROM b WHERE b.y = 43 )

But at this point there are no rels that are not eliminated in any UNION
arm, so that no de-duping would occur in the UNION, meaning that we'd
double-count any join rows in which both a.x = 42 and b.y = 43.

I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely. But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication. To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.

So full joins definitely break this whole optimization. I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step. If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm. The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion. In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row. That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.

Any clearer yet?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
Cc:	David Steele <david(at)pgmasters(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-03-19 19:54:11
Message-ID:	31463.1489953251@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Consider
> SELECT count(*)
> FROM a FULL JOIN b ON (a.id = b.id)
> WHERE (a.x = 42 OR b.y = 43);
> and suppose that a and b have mutual FK constraints guaranteeing that
> every non-null a.id value has exactly one match in b and vice versa.

Oh, that was sloppy of me. Join removal depends on unique constraints
not FK constraints. So actually, this example just requires assuming
that both a.id and b.id are known unique, which is much less far-fetched
than assuming circular FK constraints.

regards, tom lane

From:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	David Steele <david(at)pgmasters(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-03-20 02:54:53
Message-ID:	6cf3ba25-7dbc-20fe-4ff2-4f60a369a577@openscg.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/19/17 2:32 PM, Tom Lane wrote:
> Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
>> Having thought about it, I share Tom's concerns. And now I'm worried
>> about what happens if there are multiple separate OR clauses. I guess
>> those would be handled by separate UNIONs?
>
> As proposed, the patch would try to optimize by splitting each OR clause
> independently, and would choose whichever way gave the best cost estimate.
> I'm not sure it's possible to do better than that, and even if it is,
> I think improving it could be left for later.

Agreed.

> I'd also considered an approach of de-duping on the basis of all relation
> ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
> in which the rel was eliminated entirely. But that doesn't fix it,
> because in this example the first arm would return (a.ctid, NULL) while
> the second arm would return (NULL, b.ctid), so that the UNION step would
> still fail to detect any duplication. To make it work, we'd have to not
> eliminate the joins, which would pretty much defeat the usefulness of
> the optimization for your original example case.

It might still be worth-while in some circumstances. In your example, if
there were these indexes:

a__id ON a(id), a__x ON a(x)
b__id ON b(id), b__y ON b(y)

then it might be faster to nested loop a__x=42 to b__id=a.id and union
that with b__y=43 nested to a__id=b.id.

That said, now isn't the time to be adding more complexity.

> So full joins definitely break this whole optimization. I think it's okay
> with left joins though, because when starting from "a left join b" it will
> never be possible to remove "a" so we'll always include a.ctid in the
> UNION de-duping step. If b is removed in some arm, then it must be true
> that we get exactly one left-join output row per a row, regardless of the
> contents of b, in that arm. The argument for the patch being okay is
> essentially that we must get exactly one left-join output row per a row,
> regardless of the contents of b, in *every* arm, because the various
> modified versions of the OR clause can't affect that conclusion. In some
> of the arms we might not remove b, and we might even be able to reduce the
> left join to an inner join, but there should still be no more than one
> join output row per a row. That being the case, it should be sufficient
> to de-dup using a.ctid while ignoring b.ctid.

The only case I can think of is: would it be possible (perhaps not
today; maybe in the future) for other parts of the query to affect join
elimination? I can't conceive of how that might happen, but if it did
then it's possible that the elimination would work differently with the
UNION than it would with an OR.

The comment on join_is_removable() does mention that there's other
potentially interesting cases that we can't handle now; it's maybe worth
mentioning

Other than that, I can't see any issues with your logic.

> Any clearer yet?

Absolutely. I think it would be very valuable to include that with the
initial comment in planunionor.c. Join reduction and removal is already
tricky enough to wrap your head around.

Other comments:

> + * is retty mechanical, but we can't do it until we have a RelOptInfo for the

s/retty/pretty/

I suspect that in many systems single-table queries are far more common
than CTEs, so maybe it's worth reversing those two tests in
split_join_or_clauses().

For the Unique path, it would be nice if the code did what would be
necessary to consider a TID hash join, since that means a user could
create the appropriate operator and it would just be picked up.
Certainly not worth much effort at this point though.

> + /*
> + * Must not have any volatile functions in FROM or WHERE (see notes at
> + * head of file).
> + */
> + if (contain_volatile_functions((Node *) parse->jointree))

Is there by chance anywhere else that needs to check that? Maybe worth
adding the info to the Query struct if so.

> + * We insist that all baserels used in the query be plain relations, so

Dumb question... views have already be decomposed at this point, right?

Perhaps less dumb question... what happens if the original query already
had setOps? AIUI setOps work has already been done by the time
split_join_or_clauses() happens; I just want to check that that's OK.

I'm not sure a GUC is worth it... I suspect that any query with multiple
rels and an OR condition is going to be expensive enough that whatever
additional plan time there is won't be noticeable.

I've verified that the patch still applies and make check-world is clean.
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-04-03 18:44:20
Message-ID:	20170403184420.ea3knbbap5acttnv@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2017-02-14 14:18:54 -0500, Tom Lane wrote:
> I think this might be code-complete, modulo the question of whether we
> want an enabling GUC for it. I'm still concerned about the question
> of whether it adds more planning time than it's worth for most users.
> (Obviously it needs some regression test cases too, and a lot more
> real-world testing than I've given it.)

> One point that could use further review is whether the de-duplication
> algorithm is actually correct. I'm only about 95% convinced by the
> argument I wrote in planunionor.c's header comment.
>
> Potential future work includes finding join ORs in upper-level INNER JOIN
> ON clauses, and improving matters so we can do the unique-ification with
> hashing as well as sorting. I don't think either of those things has to
> be in the first commit, though.

Are you planning to push this into v10? Given the looming freeze I'm
inclined to bump this to the next CF.

- Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-04-03 18:49:05
Message-ID:	29861.1491245345@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2017-02-14 14:18:54 -0500, Tom Lane wrote:
>> One point that could use further review is whether the de-duplication
>> algorithm is actually correct. I'm only about 95% convinced by the
>> argument I wrote in planunionor.c's header comment.

> Are you planning to push this into v10? Given the looming freeze I'm
> inclined to bump this to the next CF.

I'm still not fully convinced the patch is correct, so I have no problem
with bumping it out.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-09-12 12:48:14
Message-ID:	12648.1505220494@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
> I've verified that the patch still applies and make check-world is clean.

Not any more :-(. Here's a v3 rebased over HEAD. No substantive
change from v2.

regards, tom lane

Attachment	Content-Type	Size
join-or-to-union-3.patch	text/x-diff	29.8 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2017-11-30 01:29:39
Message-ID:	CAB7nPqQQGk2NpfExZDJiqQNZZh7dhpw37u5wno0h0LRAWEamLA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 12, 2017 at 9:48 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
>> I've verified that the patch still applies and make check-world is clean.
>
> Not any more :-(. Here's a v3 rebased over HEAD. No substantive
> change from v2.

Patch applies and compiles, and it got no reviews. Moved to CF 2018-01.
--
Michael

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-01-04 22:50:48
Message-ID:	11588.1515106248@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
>> I've verified that the patch still applies and make check-world is clean.

> Not any more :-(. Here's a v3 rebased over HEAD. No substantive
> change from v2.

Again rebased up to HEAD; still no substantive changes.

regards, tom lane

Attachment	Content-Type	Size
join-or-to-union-4.patch	text/x-diff	29.8 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-02-02 12:41:37
Message-ID:	8accc0ed-a86a-4b03-46b1-af1216b37ecd@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/04/2018 11:50 PM, Tom Lane wrote:
> I wrote:
>> Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
>>> I've verified that the patch still applies and make check-world is clean.
>
>> Not any more :-(. Here's a v3 rebased over HEAD. No substantive
>> change from v2.
>
> Again rebased up to HEAD; still no substantive changes.
>

ISTM this patch got somewhat stuck as we're not quite sure the
transformation is correct in all cases. Is my impression correct? If
yes, how to we convince ourselves? Would some sort of automated testing
(generating data and queries) help? I'm willing to spend some cycles on
that, if considered helpful.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-02-02 14:26:54
Message-ID:	14593.1517581614@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
> ISTM this patch got somewhat stuck as we're not quite sure the
> transformation is correct in all cases. Is my impression correct?

Yeah, that's the core issue.

> If yes, how to we convince ourselves? Would some sort of automated testing
> (generating data and queries) help? I'm willing to spend some cycles on
> that, if considered helpful.

I'm not sure if that would be enough to convince doubters. On the other
hand, if it found problems, that would definitely be useful.

regards, tom lane

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-02-03 00:50:35
Message-ID:	62132903-20a9-2e2a-13ed-d91da32cfb8f@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/02/2018 03:26 PM, Tom Lane wrote:
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>> ISTM this patch got somewhat stuck as we're not quite sure the
>> transformation is correct in all cases. Is my impression correct?
>
> Yeah, that's the core issue.
>
>> If yes, how to we convince ourselves? Would some sort of automated testing
>> (generating data and queries) help? I'm willing to spend some cycles on
>> that, if considered helpful.
>
> I'm not sure if that would be enough to convince doubters. On the other
> hand, if it found problems, that would definitely be useful.
>

OK, I'll take a stab at this, then. I wasn't really suggesting it might
prove definitely prove the transformation is correct - clearly, it can
only find counter-examples. But I think it's worth it anyway.

BTW wouldn't it be possible to derive "traditional" proof in relational
algebra, similarly to other transforms?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-02-03 00:53:13
Message-ID:	5356.1517619193@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
> BTW wouldn't it be possible to derive "traditional" proof in relational
> algebra, similarly to other transforms?

Perhaps. The patch depends on CTID, but you could probably model that
as a primary key in a proof.

regards, tom lane

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-02-24 21:00:32
Message-ID:	CAH2-Wz=QkWBHG8dvY-phFChDB0uYav8X8qqp2K8FcTpovX42Pg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 2, 2018 at 4:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>> BTW wouldn't it be possible to derive "traditional" proof in relational
>> algebra, similarly to other transforms?
>
> Perhaps. The patch depends on CTID, but you could probably model that
> as a primary key in a proof.

I'll remind you that commit b648b703 made the TID sort performed by
CREATE INDEX CONCURRENTLY over 3 times faster in cases where the sort
completes in memory. The simplest way to get a significant portion of
that benefit for your approach with sort+unique on CTID would be to
add sortsupport with abbreviated keys to the TID btree opclass.

--
Peter Geoghegan

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-03-30 02:05:27
Message-ID:	20180330020527.hllsqse2vwqldfa5@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I've only skimmed the thread, looking at the patch on its own.

On 2018-01-04 17:50:48 -0500, Tom Lane wrote:
> diff --git a/src/backend/optimizer/plan/plaindex ...dd11e72 .
> --- a/src/backend/optimizer/plan/planunionor.c
> +++ b/src/backend/optimizer/plan/planunionor.c
> @@ -0,0 +1,667 @@
> +/*-------------------------------------------------------------------------
> + *
> + * planunionor.c
> + * Consider whether join OR clauses can be converted to UNION queries.
> + *
> + * The current implementation of the UNION step is to de-duplicate using
> + * row CTIDs.

Could we skip using the ctid if there's a DISTINCT (or something to that
effect) above? We do not need to avoid removing rows that are identical
if that's done anyway.

> A big limitation is that this only works on plain relations,
> + * and not for instance on foreign tables. Another problem is that we can
> + * only de-duplicate by sort/unique, not hashing; but that could be fixed
> + * if we write a hash opclass for TID.

I wonder if an alternative could be some sort of rowid that we invent.
It'd not be that hard to introduce an executor node (or do it in
projection) that simply counts row and returns that as a
column. Together with e.g. range table id that'd be unique. But for that
we would need to guarantee that foreign tables / subqueries /
... returned the same result in two scans. We could do so by pushing
the data gathering into a CTE, but that'd make this exercise moot.

Why can't we ask at least FDWs to return something ctid like?

> + * To allow join removal to happen, we can't reference the CTID column
> + * of an otherwise-removable relation.

A brief hint why wouldn't hurt.

> +/*
> + * Is query as a whole safe to apply union OR transformation to?
> + * This checks relatively-expensive conditions that we don't want to
> + * worry about until we've found a candidate OR clause.
> + */
> +static bool
> +is_query_safe_for_union_or_transform(PlannerInfo *root)
> +{
> + Query *parse = root->parse;
> + Relids allbaserels;
> + ListCell *lc;
> + int relid;
> +
> + /*
> + * Must not have any volatile functions in FROM or WHERE (see notes at
> + * head of file).
> + */
> + if (contain_volatile_functions((Node *) parse->jointree))
> + return false;

Hm, are there any SRFs that could be in relevant places? I think we
reject them everywhere were they'd be problematic (as targetlist is
processed above)?

Do you have any plans for this patch at this moment?

Greetings,

Andres Freund

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-04-01 14:51:22
Message-ID:	CAKJS1f-rdJ4XoQ1D32zCmxei3soKrK_hpuDRz4yFwnm0tmAwdg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3 February 2018 at 03:26, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>> ISTM this patch got somewhat stuck as we're not quite sure the
>> transformation is correct in all cases. Is my impression correct?
>
> Yeah, that's the core issue.
>
>> If yes, how to we convince ourselves? Would some sort of automated testing
>> (generating data and queries) help? I'm willing to spend some cycles on
>> that, if considered helpful.
>
> I'm not sure if that would be enough to convince doubters. On the other
> hand, if it found problems, that would definitely be useful.

I've read over this thread and as far as I can see there is concern
that the UNION on the ctids might not re-uniquify the rows again. Tom
mentioned a problem with FULL JOINs, but the concern appears to have
been invalidated due to wrong thinking about how join removals work.

As of now, I'm not quite sure who exactly is concerned. Tom thought
there was an issue but quickly corrected himself.

As far as I see it, there are a few cases we'd need to disable the optimization:

1. Query contains target SRFs (the same row might get duplicated, we
don't want to UNION these duplicates out again, they're meant to be
there)
2. Query has setops (ditto)
3. Any base rels are not RELKIND_RELATION (we need the ctid to
uniquely identify rows)
4. Query has volatile functions (don't want to evaluate volatile
functions more times than requested)

As far as the DISTINCT clause doing the right thing for joins, I see
no issues, even with FULL JOINs. In both branches of the UNION the
join condition will be the same so each side of the join always has
the same candidate row to join to. I don't think the optimization is
possible if there are OR clauses in the join condition, but that's not
being proposed.

FULL JOINS appear to be fine as the row is never duplicated on a
non-match, so there will only be one version of t1.ctid, NULL::tid or
NULL::tid, t1.ctid and all ctids in the distinctClauses cannot all be
NULL at once.

I used the following SQL to help my brain think through this. There
are two versions of each query, one with DISTINCT and one without. If
the DISTINCT returns fewer rows than the one without then we need to
disable this optimization for that case. I've written queries for 3 of
the above 4 cases. I saw from reading the thread that case #4 is
already disabled:

drop table if exists t1,t2;
create table t1 (a int);
create table t2 (a int);

insert into t1 values(1),(1),(2),(4);
insert into t2 values(1),(1),(3),(3),(4),(4);

select t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a;

select distinct t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a;

-- case 1: must disable in face of tSRFs

select ctid from (select ctid,generate_Series(1,2) from t1) t;

select distinct ctid from (select ctid,generate_Series(1,2) from t1) t;

-- case 2: must disable optimization with setops.

select ctid from (select ctid from t1 union all select ctid from t1) t;

select distinct ctid from (select ctid from t1 union all select ctid from t1) t;

-- case 3: must disable if we join to anything other than a
RELKIND_RELATION (no ctid)

select ctid from (select t1.ctid from t1 inner join (values(1),(1))
x(x) on t1.a = x.x) t;

select distinct ctid from (select t1.ctid from t1 inner join
(values(1),(1)) x(x) on t1.a = x.x) t;

I've not read the patch yet, but I understand what it's trying to
achieve. My feelings about the patch are that it would be useful to
have. I think if someone needs this then they'll be very happythat
we've added it. I also think there should be a GUC to disable it, and
it should be enabled through the entire alpha/beta period, and we
should consider what the final value for it should be just before RC1.
It's a bit sad to exclude foreign tables, and I'm not too sure what
hurdles this leaves out for pluggable storage. No doubt we'll need to
disable the optimization for those too unless they can provide us with
some row identifier.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-04-01 14:54:34
Message-ID:	CAKJS1f9ee7MSf30i=a8REuuDkQdkeSV+WERu0VYkUQjxH9w1qw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30 March 2018 at 15:05, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> + * To allow join removal to happen, we can't reference the CTID column
>> + * of an otherwise-removable relation.
>
> A brief hint why wouldn't hurt.

Maybe something like:

/*
* Join removal is only ever possible when no columns of the
to-be-removed relation
* are referenced. If we added the CTID here then we could
inadvertently disable join
* removal. We'll need to delay adding the CTID until after join
removal takes place.
*/

(I've not read the code, just the thread)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-08-23 18:10:46
Message-ID:	740.1535047846@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> [ join-or-to-union-4.patch ]

Rebased up to HEAD, per cfbot nagging. Still no substantive change from
v2.

regards, tom lane

Attachment	Content-Type	Size
join-or-to-union-5.patch	text/x-diff	29.9 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-08-24 00:20:31
Message-ID:	CAH2-WzkFm4KBWAkDBiN2nOaXLDJ7hLt=+dY3bc10fe266E+=1w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 23, 2018 at 11:10 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Rebased up to HEAD, per cfbot nagging. Still no substantive change from
> v2.

I happened to have the opportunity to talk to Tom about this patch in
person. I expressed some very general concerns that are worth
repeating publicly.

This patch adds an enhancement that is an example of a broader class
of optimizer enhancement primarily aimed at making star-schema queries
have more efficient plans, by arranging to use several independent
nested loop joins based on a common pattern. Each nestloop join has
one particular dimension table on the outer side, and the fact table
on the inner side. The query plan is not so much a tree as it is a DAG
(directed acyclic graph), because the fact table is visited multiple
times. (There are already cases in Postgres in which the query plan is
technically a DAG, actually, but it could be taken much further.)

Aside from being inherently more efficient, DAG-like star schema plans
are also *ideal* targets for parallel query. The executor can execute
each nested loop join in a parallel worker with minimal contention --
the inner side of each nestloop join all probe a different fact table
index to the others. It's almost like executing several different
simple queries concurrently, with some serial processing at the end.
Even that serial processing can sometimes be minimized by having some
of the parallel workers use a Bloom filter in shared memory.

Tom is already concerned that the optimization added by this patch may
be too much of a special case, which is understandable. It may be that
we're failing to identify some greater opportunity to add DAG-like
plans for star schema queries.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-08-24 01:27:54
Message-ID:	CAH2-WzmWSDO3KTMQqMVH6emCz-Y0G7C_WZwLF40pocfj7SBxFw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 23, 2018 at 5:20 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> This patch adds an enhancement that is an example of a broader class
> of optimizer enhancement primarily aimed at making star-schema queries
> have more efficient plans, by arranging to use several independent
> nested loop joins based on a common pattern. Each nestloop join has
> one particular dimension table on the outer side, and the fact table
> on the inner side.

Correction: I meant that for each join, the outer side is a scan of
some particular fact table index, while the inner side probes some
particular dimension table's primary key, and evaluates the
dimension's qual.

--
Peter Geoghegan

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-09-04 01:59:10
Message-ID:	20180904015910.GA1797012@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 12, 2017 at 09:32:36AM -0800, Tom Lane wrote:
> It's not so much poor choices as the cost of the optimization attempt ---
> if there's a K-relation OR clause, this will increase the cost of planning
> by a factor approaching K+1, whether or not you get a better plan out of
> it. I ran the regression tests with some instrumentation and determined
> that this logic fires a dozen or two times, and fails to produce a plan
> that looks cheaper than the standard plan in any of those cases. So if we
> go down this road, not only do we need a GUC but I suspect it had better
> default to off; only people using star schemas are really likely to get a
> win out of it.

I benchmarked using the attached script. Highlights:

$ perl mk-bench-union-or.pl --dimensions=11 --tests=const | psql -X
# TRAP: FailedAssertion("!(uniq_exprs != ((List *) ((void *)0)))", File: "planunionor.c", Line: 335)

$ perl mk-bench-union-or.pl --dimensions=11 --tests=var --exhaustive --values-per-dimension=0 | psql -X
# TRAP: FailedAssertion("!(list_length(dest_tlist) == list_length(src_tlist))", File: "tlist.c", Line: 344)

$ perl mk-bench-union-or.pl --dimensions=7 --or-clauses=20 --exhaustive --tests=const,rowmark | psql -X
... (with planunionor.c plan)
Planning Time: 1902.013 ms
Execution Time: 291.629 ms
... (without planunionor.c plan)
Planning Time: 11.100 ms
Execution Time: 64.452 ms
# planning 170x slower, chosen plan 4.5x slower

I agree this warrants a GUC, but I propose a goal of making it fitting to
enable by default. The is_suitable_join_or_clause() test is a good indicator
of promising queries, and I suspect it's cheap enough to run all the time. As
a DBA, I'd struggle to predict when is_suitable_join_or_clause() will pass
while the optimization as a whole will lose; I would resort to testing each
important query both ways. In other words, the GUC would boil down to a
planner hint, not to a declaration about the table schema.

On Thu, Aug 23, 2018 at 02:10:46PM -0400, Tom Lane wrote:
> Rebased up to HEAD, per cfbot nagging. Still no substantive change from
> v2.

> + * is retty mechanical, but we can't do it until we have a RelOptInfo for the

Jim Nasby had pointed out this s/retty/pretty/ typo.

> + void
> + finish_union_or_paths(PlannerInfo *root, RelOptInfo *joinrel,
> + List *union_or_subpaths)
> + {
...
> + pathnode = makeNode(UniquePath);
...
> + /* Estimate number of output rows */
> + pathnode->path.rows = appendpath->rows;

If you're going to keep this highly-simplified estimate, please expand the
comment to say why it doesn't matter or what makes it hard to do better. The
non-planunionor.c path for the same query computes its own estimate of the
same underlying quantity. Though it may be too difficult in today's system,
one could copy the competing path's row count estimate here. Perhaps it
doesn't matter because higher-level processing already assumes equal row count
among competitors?

Attachment	Content-Type	Size
mk-bench-union-or.pl	application/x-perl	2.9 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-10-02 01:11:18
Message-ID:	20181002011118.GT11712@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 03, 2018 at 06:59:10PM -0700, Noah Misch wrote:
> If you're going to keep this highly-simplified estimate, please expand the
> comment to say why it doesn't matter or what makes it hard to do better. The
> non-planunionor.c path for the same query computes its own estimate of the
> same underlying quantity. Though it may be too difficult in today's system,
> one could copy the competing path's row count estimate here. Perhaps it
> doesn't matter because higher-level processing already assumes equal row count
> among competitors?

As there has been no replies to Noah's review for one month, I am
marking this patch as returned with feedback for now.
--
Michael

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-10-02 01:32:10
Message-ID:	7872.1538443930@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> On Mon, Sep 03, 2018 at 06:59:10PM -0700, Noah Misch wrote:
>> If you're going to keep this highly-simplified estimate, please expand the
>> comment to say why it doesn't matter or what makes it hard to do better. The
>> non-planunionor.c path for the same query computes its own estimate of the
>> same underlying quantity. Though it may be too difficult in today's system,
>> one could copy the competing path's row count estimate here. Perhaps it
>> doesn't matter because higher-level processing already assumes equal row count
>> among competitors?

> As there has been no replies to Noah's review for one month, I am
> marking this patch as returned with feedback for now.

FWIW, my problem with this patch is that I remain unconvinced of the basic
correctness of the transform (specifically the unique-ification approach).
Noah's points would be important to address if we were moving the patch
towards commit, but I don't see much reason to put effort into it until
we can think of a way to prove whether that works.

regards, tom lane

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-10-02 14:39:41
Message-ID:	20181002143941.GA219060@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> Michael Paquier <michael(at)paquier(dot)xyz> writes:
> > On Mon, Sep 03, 2018 at 06:59:10PM -0700, Noah Misch wrote:
> >> If you're going to keep this highly-simplified estimate, please expand the
> >> comment to say why it doesn't matter or what makes it hard to do better. The
> >> non-planunionor.c path for the same query computes its own estimate of the
> >> same underlying quantity. Though it may be too difficult in today's system,
> >> one could copy the competing path's row count estimate here. Perhaps it
> >> doesn't matter because higher-level processing already assumes equal row count
> >> among competitors?
>
> > As there has been no replies to Noah's review for one month, I am
> > marking this patch as returned with feedback for now.
>
> FWIW, my problem with this patch is that I remain unconvinced of the basic
> correctness of the transform (specifically the unique-ification approach).
> Noah's points would be important to address if we were moving the patch
> towards commit, but I don't see much reason to put effort into it until
> we can think of a way to prove whether that works.

Not even effort to fix the assertion failures I reported?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2018-10-02 14:53:40
Message-ID:	1696.1538492020@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
>> FWIW, my problem with this patch is that I remain unconvinced of the basic
>> correctness of the transform (specifically the unique-ification approach).
>> Noah's points would be important to address if we were moving the patch
>> towards commit, but I don't see much reason to put effort into it until
>> we can think of a way to prove whether that works.

> Not even effort to fix the assertion failures I reported?

If it seemed relevant to the proof-of-correctness problem, I would look
into it, but ...

regards, tom lane

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2019-03-18 00:09:42
Message-ID:	20190318000942.GA2804728@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 02, 2018 at 10:53:40AM -0400, Tom Lane wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
> > On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> >> FWIW, my problem with this patch is that I remain unconvinced of the basic
> >> correctness of the transform (specifically the unique-ification approach).
> >> Noah's points would be important to address if we were moving the patch
> >> towards commit, but I don't see much reason to put effort into it until
> >> we can think of a way to prove whether that works.
>
> > Not even effort to fix the assertion failures I reported?
>
> If it seemed relevant to the proof-of-correctness problem, I would look
> into it, but ...

I put some hours into theoretical study of the proof, and I didn't find any
holes. When the planner removes "l0 LEFT JOIN r1", it does that because
there's one output row per l0.ctid regardless of the rows of r1. Hence,
deduplicating on (l0.ctid) is equivalent to deduplicating on (l0.ctid,
r1.ctid). In the bad FULL JOIN case, (l0.ctid, r1.ctid) would be sufficient
as a key, but we're out of luck because removing the join makes us have only
l0.ctid for some tuples and only r1.ctid for other tuples.

If PostgreSQL ever gets inner join removal, it would be harder to preserve
this optimization in this form. At that point, perhaps we'd cost the path
that retains the join for the benefit of $SUBJECT. Given the performance test
results, $SUBJECT may already need a cost-based decision on whether to use it.

From:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date:	2021-05-16 02:03:07
Message-ID:	CAKU4AWp-O6goWhtVggUd1Lh=x90W5OMazUh0s71KkRC1u5dKJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 18, 2019 at 8:09 AM Noah Misch <noah(at)leadboat(dot)com> wrote:

> On Tue, Oct 02, 2018 at 10:53:40AM -0400, Tom Lane wrote:
> > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> > >> FWIW, my problem with this patch is that I remain unconvinced of the
> basic
> > >> correctness of the transform (specifically the unique-ification
> approach).
> > >> Noah's points would be important to address if we were moving the
> patch
> > >> towards commit, but I don't see much reason to put effort into it
> until
> > >> we can think of a way to prove whether that works.
> >
> > > Not even effort to fix the assertion failures I reported?
> >
> > If it seemed relevant to the proof-of-correctness problem, I would look
> > into it, but ...
>
I put some hours into theoretical study of the proof, and I didn't find any
> holes. When the planner removes "l0 LEFT JOIN r1", it does that because
> there's one output row per l0.ctid regardless of the rows of r1. Hence,
> deduplicating on (l0.ctid) is equivalent to deduplicating on (l0.ctid,
> r1.ctid). In the bad FULL JOIN case, (l0.ctid, r1.ctid) would be
> sufficient
> as a key, but we're out of luck because removing the join makes us have
> only
> l0.ctid for some tuples and only r1.ctid for other tuples.
>
> If PostgreSQL ever gets inner join removal, it would be harder to preserve
> this optimization in this form. At that point, perhaps we'd cost the path
> that retains the join for the benefit of $SUBJECT. Given the performance
> test
> results, $SUBJECT may already need a cost-based decision on whether to use
> it.
>
>
As for the proof-of-correctness problem, I think the below strategy should
be
easy to understand.

SELECT * FROM any_v WHERE (A OR B OR C).
=>
SELECT * FROM any_v WHERE A
UNION ALL
SELECT * FROM any_v WHERE B AND !A
UNION ALL
SELECT * FROM any_v WHERE C AND !A AND !(B AND !A);
where !(Expr) means (NOT (expr) OR (EXPR) IS NULL)
In this way, there is no duplicated data at the first. Oracle uses a similar
way like this[1].

- A shared planner info idea.

About the planning time case, suppose we have a query below.

SELECT * FROM t1, t2, ... t20 WHERE (t1.a = 1 or t2.a = 1) and xxx;

With the current strategy, t3 .. t20 would be planned many times, including
base
relation like (t3.. t20) and joinrel like (t{3, 4}, t(3,4,5}, t(3,4,5,6})
and
so on. I guess all these relations would get the same result at every
time. we
don't need to do that at every section of the Union ALL case. So an idea of
PlannerInfo A can access the planning result of PlannerInfo B comes, this
idea
doesn't come in my mind for the first time, many cost based
transformations may
need this.

Then I have the following POC in my mind.

PlannerInfo
{

PlannerInfo *shared_planner_info;
Relids or_relids; // this field can be improved later.
};

With this way, if we find a 'relids' is not related to or_relids. We can
grab
the planning result from shared_planner_info (in this or-union case, we can
set
that after we plan the first section of UNION).

Then I found this one would not work after planning time partition prune
since
the relid would change. For example:

P (P1, P2) PARTITION BY A
Q (Q1, Q2) PARTITION BY A

SELECT * FROM t, p, q where (t.a = 1 or p.a = 1). So in the un-transformed
case, Q2 would have relid = 7. After the transform, Q2 probably has relid
= 6.

So basically, I have 4 questions now.
1. Would investing on the shared_planner_info idea be a good idea?
2. Without planning time prune, shall the above design work?
3. Currently I want to use relid to identify a resultset and pathlists which
have the planning time prune issue, should we consider other methods?
4. Any other suggestion about to resolve the 'duplicated planning case'
besides
the shared_planner_info method?

[1]
https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion

--
Best Regards
Andy Fan (https://www.aliyun.com/)