Re: updated join removal patch

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: updated join removal patch
Date: 2009-09-16 02:10:37
Message-ID: 603c8f070909151910x19c69651xda5b4854991301bd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 15, 2009 at 9:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Here we go again.  Following Tom's advice to not insert crocks in
>> add_path(), I've instead introduced a noopjoin path type which ignores
>> its inner side.  This could possibly be simplified to just a "noop"
>> path that doesn't even include an outer side, but the way I've done it
>> here doesn't really cost anything and might allow debug pprints to
>> print more useful information.  If, in the future, we have some other
>> use for a noop path, we might want to reconsider, though.
>
> I took a closer look at this version of the patch.  Some comments:
>
> * I'm not really happy with the "NoopJoinPath" representation.  In the
> first place, it isn't a join and should not be carrying a useless inner
> path.  The other nasty thing about it is that it violates the rule that
> path.pathtype is supposed to be the nodetag for the executable plan node
> that corresponds to the path.  My original thought about this was to just
> use the left input's paths as-is.  I am not totally sure that that works
> --- up to now, all the members of a reloptinfo's path list have had that
> same reloptinfo as their parent --- but it seems less ugly than this.
>
> A possible compromise is to use T_SubqueryScan as the pathtype, with
> the idea that we're pretending a SubqueryScan is to be inserted and then
> immediately optimized away.  But I don't want the inner path in there
> in any case.

I don't have strong feelings about it. I thought about making just a
Noop path type, but I couldn't see any clear reason to prefer it one
way or the other. The only reason we need a path type at all is
because you didn't like the crocks I inserted into add_path() to avoid
pfree()-ing paths that might still be pointed to from elsewhere. But
those crocks were VASTLY simpler than this, which feels completely
Rube Goldberg-esque to me. If it were up to me, we'd be either
reinserting those crocks, or looking for some less crock-y variant of
them.

> * Speaking of the left input, it's not good enough to copy up only the
> cheapest startup and cheapest total paths.  If there are any other paths
> there at all, they're probably there because they have interesting sort
> orders.  With the patch as-is, it'd be possible for join removal to be a
> net loss because it forces an extra sort at an upper level.  Now it's
> possible that some of those sort orderings are only interesting to use for
> mergejoining in the same join that we're removing; in which case they
> could safely be discarded.  But I'm not sure that's worth checking for,
> and I am sure that throwing everything away shouldn't be the base
> behavior.

Good point.

> * I'm quite unimpressed with the refactoring you did to make a single
> function deal with merge clauses, hash clauses, *and* join removal
> clauses.  I think that makes it close to unmaintainable and is buying
> little if anything speedwise.  We can afford another iteration over
> the join clauses, especially if there's a GUC to let people turn it
> off.  (BTW, do we really need that GUC, or is it only there for testing
> the patch?  I'm leaning towards not having it.)

You're the committer; I'm not. But I completely disagree. There
isn't any reason at all to duplicate this logic in two separate
places, let alone three. I'd actually be in favor of merging the
existing two cases even if we weren't adding join removal. The
existing code does the identical tests in hash_inner_and_outer() in a
slightly different order than select_mergejoin_clauses() for no reason
at all.

As for a GUC, I think it would be useful to have for debugging, or in
case of bugs. It's really another join strategy, and we have enable_*
flags for all the others. But if you don't want it for some reason,
I'm not in a position to twist your arm.

> * I'm not sure about this, because surely you would have tested it,
> but isn't it looking at the wrong side of the join clauses?  I thought
> the idea is to prove the nullable (inner) side of the join unique.

Grr. I think it's more broken than that. Wow, this is really embarassing.

> * Not entirely sure where to put the code that does the hard work
> (relation_is_distinct_for).  I see you put it in pathnode.c beside
> query_is_distinct_for, but the *only* reason the latter is where it is
> is that it has only one caller, namely create_unique_path, which is
> (and belongs) in that module.  Opening up pathnode's API to include a
> function unrelated to its purpose doesn't seem right.  So I'm inclined
> to think they should both go someplace else, just not sure where.
> There doesn't seem to be any planner file whose charter is to export
> this type of knowledge; maybe we need to add one.

Yeah, it seemed a little weird to me. For a while I was reusing some
of the support functions that query_is_distinct_for() calls, but the
final version doesn't. I wonder if it should just be moved to
joinpath.c; it seems to fit in better with what is going on there.

> * It shouldn't be that hard to support expression indexes.  I think the
> only reason you can't do it right now is that you chose an intermediate
> data structure that presumes simple Vars ... but if you refactor to
> have bespoke code examining the indexes and joinclauses together,
> I think it would work all right.

I'm not sure how to compare exprs, or what you mean by bespoke code,
but I'd be happy to have someone generalize the approach.

> * I wonder whether all the relevant clauses are really to be found as
> join clauses.  Consider
>        tab1 left join tab2 on (tab1.a = tab2.x and tab2.y = 42)
> where tab2 has a unique index on (x,y).  This would at least suggest
> examining the inner rel's baserestrictclauses along with the current
> join's clauses.

Also a good point.

> * I wouldn't really bother with cost_noopjoin.  It does not, and never
> will, encapsulate any interesting knowledge.  In other places where we
> have similar issues (eg no-op UniquePaths in create_unique_path), we
> just copy up the child path's costs without any folderol.

Don't care one way or the other.

> Do you want to have another go at this, or shall I?

Please feel free if you're so inclined.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-09-16 02:25:18 Re: updated join removal patch
Previous Message Fujii Masao 2009-09-16 02:01:10 Re: PGCluster-II Progress