Re: Project proposal/comments please - query optimization

From: Kim Bisgaard <kib+pg(at)dmi(dot)dk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Project proposal/comments please - query optimization
Date: 2005-08-12 08:11:31
Message-ID: 42FC59B3.1040902@dmi.dk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:

>Kim Bisgaard <kib+pg(at)dmi(dot)dk> writes:
>
>
>>I have noticed a deficiency in the current query optimizer related to
>>"full outer joins". Tom Lane has confirmed to me that it will not be 8.1
>>material.
>>
>>
>
>... There
>are related issues involving N-way joins that we're still not very
>good at.
>...
>
>Consider this version of Kim Bisgaard's example:
> SELECT FROM a join (b full join c using (id)) using (id)
>If A is small and B,C have indexes on ID then it is interesting to consider
>a plan like
> Nest Loop
> Scan A
> Merge Full Join
> Indexscan B using id = outer.id
> Indexscan C using id = outer.id
>We are fairly far from being able to do this. generate_outer_join_implications
>could easily be modified to generate derived equalities (I think it works to
>allow a deduction against any clause not overlapping the outerjoin itself)
>but the planner would want to evaluate them at the wrong level, and the
>executor doesn't have support for passing the outer variable down more than
>one level of join. This is why the existing hack works only for equalities
>to pseudoconstants. We could maybe mark join RestrictInfos as "valid only
>below xxx" and ignore them when processing a join that includes all of the
>indicated rels? Still not clear how you get the planner to recognize the
>above as an inner indexscan situation though.
>
The query samples I gave was the smallest test I could find to provoke
the behavior. Tom is right in that the full case I am ideally want
solved is of the form above with lots (below 20) of full outer joined
tables.

I am still a little intrigued by the effect of substituting "full" with
"left" in my examples; maybe an alternative to Toms idea could be to
work in the direction of treating "full" more like "left/right"

There are some examples of my problems (on nightly builds of yesterday)
at the bottom of the mail.

Regards,
Kim.

obsdb=> explain analyse select wmo_id,timeobs,temp_dry_at_2m,temp_grass
from station
join (temp_grass
full join temp_dry_at_2m using (station_id, timeobs)
) using (station_id)
where wmo_id=6065 and '2004-1-2 6:0' between startdate and enddate
and timeobs = '2004-1-2 06:0:0';


QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=5.04..372928.92 rows=1349 width=28) (actual
time=23986.480..46301.966 rows=1 loops=1)
Hash Cond: (COALESCE("outer".station_id, "outer".station_id) =
"inner".station_id)
-> Merge Full Join (cost=0.00..338124.90 rows=6957100 width=32)
(actual time=23965.761..46281.380 rows=76 loops=1)
Merge Cond: (("outer".timeobs = "inner".timeobs) AND
("outer".station_id = "inner".station_id))
Filter: (COALESCE("outer".timeobs, "inner".timeobs) =
'2004-01-02 06:00:00'::timestamp without time zone)
-> Index Scan using temp_grass_idx on temp_grass
(cost=0.00..75312.59 rows=2406292 width=16) (actual
time=12.436..4916.043 rows=2406292 loops=1)
-> Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m
(cost=0.00..210390.85 rows=6957100 width=16) (actual
time=13.696..21363.054 rows=6956994 loops=1)
-> Hash (cost=5.03..5.03 rows=1 width=8) (actual
time=19.612..19.612 rows=0 loops=1)
-> Index Scan using wmo_idx on station (cost=0.00..5.03
rows=1 width=8) (actual time=19.586..19.591 rows=1 loops=1)
Index Cond: ((wmo_id = 6065) AND ('2004-01-02
06:00:00'::timestamp without time zone >= startdate) AND ('2004-01-02
06:00:00'::timestamp without time zone <= enddate))
Total runtime: 46302.208 ms
(11 rows)

obsdb=> explain analyse select
wmo_id,timeobs,wind_dir_10m,temp_dry_at_2m,temp_grass
from station
join (temp_grass
full join temp_dry_at_2m using (station_id, timeobs)
full join wind_dir_10m using (station_id, timeobs)
) using (station_id)
where wmo_id=6065 and '2004-1-2 6:0' between startdate and enddate
and timeobs = '2004-1-2 06:0:0';


QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1249443.54..1700082.70 rows=1392 width=40) (actual
time=331437.803..384389.174 rows=1 loops=1)
Hash Cond: (COALESCE(COALESCE("outer".station_id,
"outer".station_id), "outer".station_id) = "inner".station_id)
-> Merge Full Join (cost=1249438.51..1664170.44 rows=7178660
width=48) (actual time=291406.613..384359.322 rows=82 loops=1)
Merge Cond: (("outer".station_id = "inner"."?column7?") AND
("outer".timeobs = "inner"."?column8?"))
Filter: (COALESCE("inner"."?column8?", "outer".timeobs) =
'2004-01-02 06:00:00'::timestamp without time zone)
-> Index Scan using wind_dir_10m_idx on wind_dir_10m
(cost=0.00..299534.38 rows=7178660 width=16) (actual
time=15.194..52605.785 rows=7178639 loops=1)
-> Sort (cost=1249438.51..1266831.26 rows=6957100 width=32)
(actual time=290255.163..302211.520 rows=8743745 loops=1)
Sort Key: COALESCE(temp_grass.station_id,
temp_dry_at_2m.station_id), COALESCE(temp_grass.timeobs,
temp_dry_at_2m.timeobs)
-> Merge Full Join (cost=0.00..337004.00 rows=6957100
width=32) (actual time=28.909..73512.533 rows=8743745 loops=1)
Merge Cond: (("outer".timeobs = "inner".timeobs)
AND ("outer".station_id = "inner".station_id))
-> Index Scan using temp_grass_idx on temp_grass
(cost=0.00..75312.59 rows=2406292 width=16) (actual
time=15.178..6088.732 rows=2406292 loops=1)
-> Index Scan using temp_dry_at_2m_idx on
temp_dry_at_2m (cost=0.00..210390.85 rows=6957100 width=16) (actual
time=13.694..25573.509 rows=6956994 loops=1)
-> Hash (cost=5.03..5.03 rows=1 width=8) (actual
time=28.609..28.609 rows=0 loops=1)
-> Index Scan using wmo_idx on station (cost=0.00..5.03
rows=1 width=8) (actual time=28.581..28.585 rows=1 loops=1)
Index Cond: ((wmo_id = 6065) AND ('2004-01-02
06:00:00'::timestamp without time zone >= startdate) AND ('2004-01-02
06:00:00'::timestamp without time zone <= enddate))
Total runtime: 384979.287 ms
(16 rows)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message William ZHANG 2005-08-12 10:11:54 CREATE USER and pg_user
Previous Message Martijn van Oosterhout 2005-08-12 07:56:48 Re: Determining return type of polymorphic function