Re: updated join removal patch

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: updated join removal patch
Date: 2009-09-16 01:29:58
Message-ID: 26443.1253064598@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Here we go again. Following Tom's advice to not insert crocks in
> add_path(), I've instead introduced a noopjoin path type which ignores
> its inner side. This could possibly be simplified to just a "noop"
> path that doesn't even include an outer side, but the way I've done it
> here doesn't really cost anything and might allow debug pprints to
> print more useful information. If, in the future, we have some other
> use for a noop path, we might want to reconsider, though.

I took a closer look at this version of the patch. Some comments:

* I'm not really happy with the "NoopJoinPath" representation. In the
first place, it isn't a join and should not be carrying a useless inner
path. The other nasty thing about it is that it violates the rule that
path.pathtype is supposed to be the nodetag for the executable plan node
that corresponds to the path. My original thought about this was to just
use the left input's paths as-is. I am not totally sure that that works
--- up to now, all the members of a reloptinfo's path list have had that
same reloptinfo as their parent --- but it seems less ugly than this.

A possible compromise is to use T_SubqueryScan as the pathtype, with
the idea that we're pretending a SubqueryScan is to be inserted and then
immediately optimized away. But I don't want the inner path in there
in any case.

* Speaking of the left input, it's not good enough to copy up only the
cheapest startup and cheapest total paths. If there are any other paths
there at all, they're probably there because they have interesting sort
orders. With the patch as-is, it'd be possible for join removal to be a
net loss because it forces an extra sort at an upper level. Now it's
possible that some of those sort orderings are only interesting to use for
mergejoining in the same join that we're removing; in which case they
could safely be discarded. But I'm not sure that's worth checking for,
and I am sure that throwing everything away shouldn't be the base
behavior.

* I'm quite unimpressed with the refactoring you did to make a single
function deal with merge clauses, hash clauses, *and* join removal
clauses. I think that makes it close to unmaintainable and is buying
little if anything speedwise. We can afford another iteration over
the join clauses, especially if there's a GUC to let people turn it
off. (BTW, do we really need that GUC, or is it only there for testing
the patch? I'm leaning towards not having it.)

* I'm not sure about this, because surely you would have tested it,
but isn't it looking at the wrong side of the join clauses? I thought
the idea is to prove the nullable (inner) side of the join unique.

* Not entirely sure where to put the code that does the hard work
(relation_is_distinct_for). I see you put it in pathnode.c beside
query_is_distinct_for, but the *only* reason the latter is where it is
is that it has only one caller, namely create_unique_path, which is
(and belongs) in that module. Opening up pathnode's API to include a
function unrelated to its purpose doesn't seem right. So I'm inclined
to think they should both go someplace else, just not sure where.
There doesn't seem to be any planner file whose charter is to export
this type of knowledge; maybe we need to add one.

* It shouldn't be that hard to support expression indexes. I think the
only reason you can't do it right now is that you chose an intermediate
data structure that presumes simple Vars ... but if you refactor to
have bespoke code examining the indexes and joinclauses together,
I think it would work all right.

* I wonder whether all the relevant clauses are really to be found as
join clauses. Consider
tab1 left join tab2 on (tab1.a = tab2.x and tab2.y = 42)
where tab2 has a unique index on (x,y). This would at least suggest
examining the inner rel's baserestrictclauses along with the current
join's clauses.

* I wouldn't really bother with cost_noopjoin. It does not, and never
will, encapsulate any interesting knowledge. In other places where we
have similar issues (eg no-op UniquePaths in create_unique_path), we
just copy up the child path's costs without any folderol.

Do you want to have another go at this, or shall I?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2009-09-16 01:43:50 Re: query cancel issues in contrib/dblink
Previous Message Scott Mohekey 2009-09-15 23:56:26 Re: Timestamp to time_t