Re: Projection while performing joins.

Lists: pgsql-hackers
From: Anagh Lal <anaghlal2001(at)yahoo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Projection while performing joins.
Date: 2003-02-11 04:41:58
Message-ID: 20030211044158.5415.qmail@web14306.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,
In the merge join as well as in the nested loop join,
we do ExecProject() after we have found tuples from
the relations involved in the join.
For a join query involving just two relations and
merge join being used, the outer plan will be
NodeSort. Now, NodeSort will create Temp files and
keep a sorted version of the relation in these Temp
files. (correct?)
I was wondering,
Why do we not just store the attributes required in
the join (i.e. those in the join qual conditions and
the ones in the select list) and then perform sorting
and retrieval on these tuples rather than on the
possibly larger tuples with more attributes which we
project out in the end anyway.

Is there any reason why this is infeasible to
implement
or is this simply a wrong approach theoretically?

If none, is this being used anywhere? (hope the
answers is not postgres... i would have embarrassed
myself no limit then)

Another way of putting this would be, why don't we
push the projection operator as far down as possible.
In this case we also add the attributes in the
selection operator (of relational algebra) to the
projection. And then perform a join on the smaller
resulting relation.

Regards,

Anagh Lal.

__________________________________________________
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Anagh Lal <anaghlal2001(at)yahoo(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Projection while performing joins.
Date: 2003-02-11 05:45:09
Message-ID: 22865.1044942309@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Anagh Lal <anaghlal2001(at)yahoo(dot)com> writes:
> Why do we not just store the attributes required in
> the join (i.e. those in the join qual conditions and
> the ones in the select list) and then perform sorting
> and retrieval on these tuples rather than on the
> possibly larger tuples with more attributes which we
> project out in the end anyway.

We already do --- the scan nodes project out only the needed columns.

At least, that's true in released versions. Just a few days ago I put
a hack into CVS tip to skip the projection step for a scan node that's
not topmost in the plan. As you indicate, that's probably a net loss
when there's a Sort node directly above the scan node. Needs more
thought...

regards, tom lane


From: Anagh Lal <anaghlal2001(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Projection while performing joins.
Date: 2003-02-11 21:39:36
Message-ID: 20030211213936.73424.qmail@web14303.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,
Two parts to the mail
1)
> We already do --- the scan nodes project out only
> the needed columns.
ok..thats great.
I tried looking for what you are saying in the source
code... [before and after doing a cvs update].. but I
am still confused by the following:
In /backend/executor/nodeMergeJoin.c
in ExecMergeJoin()
In the state (the switch case) EXEC_MJ_JOINTUPLES
we still do ExecProject(), what does this do?
(this is what prompted my incorrect assumption in the
first place).

I was able to trace the ExecProject() which takes
place (due to the SeqScan node) in the
ExecScan function of /executor/execScan.c
(I made an error in assuming the tuple returned
by the function passed as an argument to ExecScan was
being returned directly )...
so now i see what you were saying in the mail but
am not clear about why we have a ExecProject() higher
up in the join nodes..its there in the NestedLoop node
too.

part 2)
> As you indicate, that's probably a net loss
> when there's a Sort node directly above the scan
> node. Needs more thought...

Some food for thought,
Let's ignore the attributes listed in the select
clause
and work only with the where clause (join condition)
attributes. And as a back reference store the
tupleid of the original whole tuple in the "working"
tuple. At the final output stage perform a lookup to
retrieve the select clause attributes of only the
qualifying tuple. Thus enabling us to work with really
small sized data.
worth trying out?

Thanks
Anagh Lal

__________________________________________________
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Anagh Lal <anaghlal2001(at)yahoo(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Projection while performing joins.
Date: 2003-02-12 05:51:22
Message-ID: 1072.1045029082@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Anagh Lal <anaghlal2001(at)yahoo(dot)com> writes:
> ... I am still confused by the following:
> In /backend/executor/nodeMergeJoin.c
> in ExecMergeJoin()
> In the state (the switch case) EXEC_MJ_JOINTUPLES
> we still do ExecProject(), what does this do?

Well, sure. A join node *must* do a projection, no? It can't simply
return either the left or the right input tuple (except in the
vanishingly-small fraction of cases where you don't actually need any
columns from the right or the left respectively; which are cases that
we don't currently bother to optimize). To create a tuple that's not
exactly one or the other you must project.

> Some food for thought,
> Let's ignore the attributes listed in the select
> clause
> and work only with the where clause (join condition)
> attributes. And as a back reference store the
> tupleid of the original whole tuple in the "working"
> tuple. At the final output stage perform a lookup to
> retrieve the select clause attributes of only the
> qualifying tuple. Thus enabling us to work with really
> small sized data.
> worth trying out?

Not sure there's a lot of traction here. In many cases, the
bottom-level scan gets advanced one or more rows before the top-level
nodes can pop out a result. (Consider GROUP BY as an example.)
I don't see the advantage of adding bookkeeping/copying for past
rows in order to avoid copying current-row data around.

But feel free to prove me wrong ;-)

regards, tom lane