Re: Left Outer Join much faster than non-outer Join?

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Left Outer Join much faster than non-outer Join?
Date: 2005-03-31 18:01:13
Message-ID: 1112292073.16721.327.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, 2005-03-31 at 00:15 -0800, Ron Mayer wrote:
> Ron Mayer wrote:
> > Tom Lane wrote:
> >> rm_pg(at)cheapcomplexdevices(dot)com writes:
> >>> select *
> >>> from streetname_lookup as sl
> >>> join city_lookup as cl on (true)
> >>> left outer join tlid_smaller as ts on (sl.geo_streetname_id =
> >>> ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
> >>> where str_name='alamo' and city='san antonio' and state='TX'
> >>> ;
> >> That's a fairly odd query;
> >
> >
> > I think it's a very common type of query in data warehousing.
> >
> > It's reasonably typical of a traditional star schema where
> > "streetname_lookup" and "city_lookup" are dimension tables
> > and "tlid_smaller" is the central fact table.
>

Yes, agreed.

> Although looking again I must admit the query was
> written unconventionally. Perhaps those queries are
> remnants dating back to a version when you could
> force join orders this way?
>
> Perhaps a more common way of writing it would have been:
>
> select * from tlid_smaller
> where geo_streetname_id in (select geo_streetname_id from streetname_lookup where str_name='$str_name')
> and geo_city_id in (select geo_city_id from city_lookup where city='$city' and state='$state');
>
> However this query also fails to use the multi-column
> index on (geo_streetname_id,geo_city_id). Explain
> analyze shown below.

...which is my understanding too.

> In cases where I can be sure only one result will come
> from each of the lookup queries I guess I can do this:
>
> select * from tlid_smaller
> where geo_streetname_id = (select geo_streetname_id from streetname_lookup where str_name='$str_name')
> and geo_city_id = (select geo_city_id from city_lookup where city='$city' and state='$state');
>
> which has the nicest plan of them all (explain analyze
> also shown below).

Which is not the case for the generalised star join.

The general case query here is:
SELECT (whatever)
FROM FACT, DIMENSION1 D1, DIMENSION2 D2, DIMENSION3 D3etc..
WHERE
FACT.dimension1_pk = D1.dimension1_pk
AND FACT.dimension2_pk = D2.dimension2_pk
AND FACT.dimension3_pk = D3.dimension3_pk
AND D1.dimdescription = 'X'
AND D2.dimdescription = 'Y'
AND D3.dimdescription = 'Z'
...
with FACT PK=(dimension1_pk, dimension2_pk, dimension3_pk)

with a more specific example of
SELECT sum(item_price)
FROM Sales, Store, Item, TTime
WHERE
Sales.store_pk = Store.store_pk
AND Store.region = 'UK'
AND Sales.item_pk = Item.item_pk
AND Item.category = 'Cameras'
AND Sales.time_pk = TTime.time_pk
AND TTime.month = 3
AND TTime.year = 2005

A very good plan for solving this, under specific conditions is...
CartesianProduct(Store, Item, TTime) -> Sales.PK

which accesses the largest table only once.

As Tom says, the current optimizer won't go near that plan, for good
reason, without specifically tweaking collapse limits. I know full well
that any changes in that direction will need to be strong because that
execution plan is very sensitive to even minor changes in data
distribution.

The plan requires some fairly extensive checking to be put into place.
The selectivity of requests against the smaller tables needs to be very
well known, so that the upper bound estimate of cardinality of the
cartesian product is feasible AND still low enough to use the index on
Sales.

This is probably going to need information to be captured on multi-
column index selectivity, to ensure that last part.

It is likely that the statistics targets on the dimension tables would
need to be higher enough to identify MFVs or at least reduce the upper
bound of selectivity. It is also requires the table sizes to be
examined, to ensure this type of plan is considered pointlessly.
Some other systems that support this join type, turn off checking for it
by default. We could do the same with enable_starjoin = off.

Anyway, seems like a fair amount of work there... yes?

Best Regards, Simon Riggs

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Richard_D_Levine 2005-03-31 18:24:30 Re: Reading recommendations
Previous Message Steve Wampler 2005-03-31 15:48:09 Re: Reading recommendations