Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Shigeru Hanada <shigeru(dot)hanada(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)
Date: 2015-03-17 06:23:47
Message-ID: 9A28C8860F777E439AA12E8AEA7694F8010C1B7E@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Shigeru Hanada [mailto:shigeru(dot)hanada(at)gmail(dot)com]
> Sent: Monday, March 16, 2015 9:59 PM
> To: Robert Haas
> Cc: Tom Lane; Thom Brown; Kaigai Kouhei(海外 浩平); pgsql-hackers(at)postgreSQL(dot)org
> Subject: ##freemail## Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom
> Plan API)
>
> 2015-03-14 7:18 GMT+09:00 Robert Haas <robertmhaas(at)gmail(dot)com>:
> > I think the foreign data wrapper join pushdown case, which also aims
> > to substitute a scan for a join, is interesting to think about, even
> > though it's likely to be handled by a new FDW method instead of via
> > the hook. Where should the FDW method get called from? Currently,
> > the FDW method in KaiGai's patch is GetForeignJoinPaths, and that gets
> > called from add_paths_to_joinrel(). The patch at
> >
> http://www.postgresql.org/message-id/CAEZqfEfy7p=uRpwN-Q-NNgzb8kwHbfqF82YSb9
> ztFZG7zN64Xw(at)mail(dot)gmail(dot)com
> > uses that to implement join pushdown in postgres_fdw; if you have A
> > JOIN B JOIN C all on server X, we'll notice that the join with A and B
> > can be turned into a foreign scan on A JOIN B, and similarly for A-C
> > and B-C. Then, if it turns out that the cheapest path for A-B is the
> > foreign join, and the cheapest path for C is a foreign scan, we'll
> > arrive at the idea of a foreign scan on A-B-C, and we'll realize the
> > same thing in each of the other combinations as well. So, eventually
> > the foreign join gets pushed down.
>
> From the viewpoint of postgres_fdw, incremental approach seemed
> natural way, although postgres_fdw should consider paths in pathlist
> in additon to cheapest one as you mentioned in another thread. This
> approarch allows FDW to use SQL statement generated for underlying
> scans as parts of FROM clause, as postgres_fdw does in the join
> push-down patch.
>
> > But there's another possible approach: suppose that
> > join_search_one_level, after considering left-sided and right-sided
> > joins and after considering bushy joins, checks whether every relation
> > it's got is from the same foreign server, and if so, asks that foreign
> > server whether it would like to contribute any paths. Would that be
> > better or worse? A disadvantage is that if you've got something like
> > A LEFT JOIN B LEFT JOIN C LEFT JOIN D LEFT JOIN E LEFT JOIN F LEFT
> > JOIN G LEFT JOIN H LEFT JOIN I but none of the joins can be pushed
> > down (say, each join clause calls a non-pushdown-safe function) you'll
> > end up examining a pile of joinrels - at every level of the join tree
> > - and individually rejecting each one. With the
> > build-it-up-incrementally approach, you'll figure that all out at
> > level 2, and then after that there's nothing to do but give up
> > quickly. On the other hand, I'm afraid the incremental approach might
> > miss a trick: consider small LEFT JOIN (big INNER JOIN huge ON big.x =
> > huge.x) ON small.y = big.y AND small.z = huge.z, where all three are
> > foreign tables on the same server. If the output of the big/huge join
> > is big, none of those paths are going to survive at level 2, but the
> > overall join size might be very small, so we surely want a chance to
> > recover at level 3. (We discussed test cases of this form quite a bit
> > in the context of e2fa76d80ba571d4de8992de6386536867250474.)
>
> Interesting, I overlooked that pattern. As you pointed out, join
> between big foregin tables might be dominated, perhaps by a MergeJoin
> path. Leaving dominated ForeignPath in pathlist for more optimization
> in the future (in higher join level) is an idea, but it would make
> planning time longer (and use more cycle and memory).
>
> Tom's idea sounds good for saving the path b), but I worry that
> whether FDW can get enough information at that timing, just before
> set_cheapest. It would not be good I/F if each FDW needs to copy many
> code form joinrel.c...
>
I had a call to discuss this topic with Hanada-san. Even though he
expected FDW driver needs to check and extract relations involved
in a particular join, it also means we have less problem as long as
core backend can handle these common portion for all FDW/CSP drivers.
Thus, we need care about two hook locations. The first one is
add_paths_to_joinrel() as current patch doing, for custom-scan that
adds an alternative join logic and takes underlying child nodes as
input. The other one is standard_join_search() as Tom pointed out,
for foreign-scan of remote join, or for custom-scan that replaces
an entire join subtree.

One positive aspect of this approach is, postgres_fdw can handle
whole-row-reference much simpler than bottom-up approach, according
to Hanada-san.

Remaining issue is, how to implement the core portion that extracts
relations in a particular join, and to identify join type to be
applied on a particular relations.
One rough idea is, we pull relids bitmap from the target joinrel,
then references the SpecialJoinInfo with identical union bitmap
of left/righthand. It allows to inform FDW driver which relations
and which another relations shall be joined in this level.
For example, if relids=0x007 and relids=0x0018 are left joined,
PlannerInfo shall have a SpecialJoinInfo that fits the requirement.
Also, both of left/right side is not singleton, FDW driver will
takes recursive process to construct remote join query on relids=0x007
and relids=0x0018. If all of them are inner-join, we don't need to
take care about this. All FDW driver needs to do is, just putting
the involved relation names in FROM-clause.

It is my rough idea, thus, here may be better idea to extract
relations involved in a particular join on a certain level.
Please tell me, if you have some other ideas.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2015-03-17 06:26:18 Re: assessing parallel-safety
Previous Message Craig Ringer 2015-03-17 06:09:34 Re: Moving Pivotal's Greenplum work upstream