Quick Links

Re: Partition-wise join for join between (declaratively) partitioned tables

Lists:	pgsql-hackers

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-10 06:23:04
Message-ID:	CAFjFpRfa6_n10cn3vXjN9hdTqneH6A1rfnLXy0PnCP63T2putw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 10, 2017 at 9:28 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Thu, Aug 10, 2017 at 1:39 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> On my computer it took ~1.5 seconds to plan a 1000 partition join,
>> ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
>> 4000 partition join. I poked around in a profiler a bit and saw that
>> for the 2000 partition case I spent almost half the time in
>> create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
>> and about half of that was in bms_is_subset. The other half the time
>> was in query_planner->make_one_rel which spent 2/3 of its time in
>> set_rel_size->add_child_rel_equivalences->bms_overlap and the other
>> 1/3 in standard_join_search.
>
> Ashutosh asked me how I did that. Please see attached. I was
> explaining simple joins like SELECT * FROM foofoo JOIN barbar USING
> (a, b). Here also is the experimental hack I tried when I saw
> bitmapset.c eating my CPU.
>

On my machine I observed following planning times
1000 partitions, without partition-wise join, 100ms; with
partition-wise join 500ms
2000 partitions, without partition-wise join, 320ms; with
partition-wise join 2.2s
4000 partitions, without partition-wise join, 1.3ms; with
partition-wise join 17s

So, even without partition-wise join the planning time increases at a
superlinear rate with the number of partitions.

Your patch didn't improve planning time without partition-wise join,
so it's something good to have along-with partition-wise join. Given
that Bitmapsets are used in other parts of code as well, the
optimization may affect those parts as well, esp. the overhead of
maintaining first_non_empty_wordnum.

The comment at the beginning of the file bitmapset.c says
3 * bitmapset.c
4 * PostgreSQL generic bitmap set package
5 *
6 * A bitmap set can represent any set of nonnegative integers, although
7 * it is mainly intended for sets where the maximum value is not large,
8 * say at most a few hundred.

When we created thousands of children, we have certainly crossed the
few hundred threashold. So, there may be other optimizations possible
there. Probably we should leave that out of partition-wise join
patches. Do you think we solving this problem is a prerequisite for
partition-wise join? Or should we propose that patch as a separate
enhancement?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-10 09:43:44
Message-ID:	CAEepm=3cDKOBsvKA7cmDKo0UCx6X+mFMoKuigMF3+-25_rji0g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Your patch didn't improve planning time without partition-wise join,
> so it's something good to have along-with partition-wise join. Given
> that Bitmapsets are used in other parts of code as well, the
> optimization may affect those parts as well, esp. the overhead of
> maintaining first_non_empty_wordnum.

Maybe, but if you consider that this container already deals with the
upper bound moving up by reallocating and copying the whole thing,
adjusting an int when the lower bound moves down doesn't seem like
anything to worry about...

> The comment at the beginning of the file bitmapset.c says
> 3 * bitmapset.c
> 4 * PostgreSQL generic bitmap set package
> 5 *
> 6 * A bitmap set can represent any set of nonnegative integers, although
> 7 * it is mainly intended for sets where the maximum value is not large,
> 8 * say at most a few hundred.
>
> When we created thousands of children, we have certainly crossed the
> few hundred threashold. So, there may be other optimizations possible
> there. Probably we should leave that out of partition-wise join
> patches.

> Do you think we solving this problem is a prerequisite for
> partition-wise join? Or should we propose that patch as a separate
> enhancement?

No, I'm not proposing anything yet. For now I just wanted to share
this observation about where hot CPU time goes in simple tests, and
since it turned out to be a loop in a loop that I could see an easy to
way to fix for singleton sets and sets with a small range, I couldn't
help trying it... But I'm still trying to understand the bigger
picture. I'll be interested to compare profiles with the ordered
append_rel_list version you have mentioned, to see how that moves the
hot spots.

I guess one very practical question to ask is: can we plan queries
with realistic numbers of partitioned tables and partitions in
reasonable times? Well, it certainly looks very good for hundreds of
partitions so far... My own experience of partitioning with other
RDBMSs has been on that order, 'monthly partitions covering the past
10 years' and similar, but on the other hand it wouldn't be surprising
to learn that people want to go to many thousands, especially for
schemas which just keep adding partitions over time and don't want to
drop them. As for hash partitioning, that seems to be typically done
with numbers like 16, 32 or 64 in other products from what I can
glean. Speculation: perhaps hash partitioning is more motivated by
parallelism than data maintenance and thus somehow anchored to the
ground by core counts; if so no planning performance worries there I
guess (until core counts double quite a few more times).

One nice thing about the planning time is that restrictions on the
partition key cut down planning time; so where I measure ~7 seconds to
plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions,
if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so
if someone has a keep-adding-more-partitions-over-time model then at
least their prunable current day/week/whatever queries will not suffer
quite so badly. (Yeah my computer seems to be a lot slower than yours
for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @
2.2Ghz).

Curious: would you consider joins between partitioned tables and
non-partitioned tables where the join is pushed down to be a kind of
"partition-wise join", or something else? If so, would that be a
special case, or just the logical extreme case for
0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
one single "partition" on the non-partitioned side maps to all the
partitions on the partitioned size?

--
Thomas Munro
http://www.enterprisedb.com

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-10 12:00:22
Message-ID:	CAFjFpRfkr7igCGBBWH1PQ__W-XPy1O79Y-qxCmJc6FizLqFz7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 9, 2017 at 7:09 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> I started reviewing this. It's nicely commented, but it's also very
> complicated, and it's going to take me several rounds to understand
> what all the parts do, but here's some assorted feedback after reading
> some parts of the patches, some tinkering and quite a bit of time
> spent trying to break it (so far unsuccessfully).
>

Thanks for testing the patch. Good to know it has withstood your testing.

> On my computer it took ~1.5 seconds to plan a 1000 partition join,
> ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
> 4000 partition join. I poked around in a profiler a bit and saw that
> for the 2000 partition case I spent almost half the time in
> create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
> and about half of that was in bms_is_subset. The other half the time
> was in query_planner->make_one_rel which spent 2/3 of its time in
> set_rel_size->add_child_rel_equivalences->bms_overlap and the other
> 1/3 in standard_join_search.

Thanks for profiling.

I have separately mailed about bitmapset improvements.

Equivalence classes contain all the expressions which are known to be
equal in EquivalenceClass::ec_members. For a partitioned table, there
will be as many expressions as the number of children. The child
expressions are marked as em_is_child and are looked at only when
child relids are available to the function scanning the members. The
number of equivalence members increases linearly with the number of
partitions, and the number of words in the bitmaps increases linearly
with the number of partitions, effectively the the number of words
scanned increases quadratically. Hence the superlinear increase in
time with the number of partitions. When I took separate profiles with
1000, 2000 and 4000 partitions resp. I see that 15%, 29% and 40% time
spent in bms_is_subset() resp.

I am not sure how much we can do in this patchset to reduce this
problem. Apart from your bitmapset optimization, we could perhaps use
some more efficient data structure other than list to search members
based on the relids OR re-use parent's expressions for child somehow.
I have been thinking about the second option, but never got a chance
to work on it.

>
> When using list-based partitions, it must be possible to omit the part
> of a join key that is implied by the partition because the partition
> has only one list value. For example, if I create a two level
> hierarchy with one partition per US state and then time-based range
> partitions under that, the state part of this merge condition is
> redundant:
>
> Merge Cond: ((sales_wy_2017_10.state =
> purchases_wy_2017_10.state) AND (sales_wy_2017_10.created =
> purchases_wy_2017_10.created))

That's a good idea. In fact, we could use a similar trick when the
condition is sales_wy_2017_10.state = 'state'. We can not use the
trick in case of DML or when there are locking clauses, since we need
to evaluate the qual in case the row underneath changes while locking
it. We also can not do this when one of the keys being compared is a
nullable partition key (a concept explained in partition-wise join
implementation patch), since a partition can have also have rows with
NULL values for such partition keys in that partition.

I think the idea has merit, although, I think we should handle it
targetting more generic cases like the one stated above.

>
> 0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch
>
> -partition_bounds_equal(PartitionKey key,
> +partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
> PartitionBoundInfo b1,
> PartitionBoundInfo b2)
>
> I wonder is there any value in creating a struct to represent the
> common part of PartitionKey and PartitionScheme that functions like
> this and others need? Maybe not. Perhaps you didn't want to make
> PartitionKey contain a PartitionScheme because then you'd lose the
> property that every pointer to PartitionScheme in the system must be a
> pointer to an interned (canonical) PartitionScheme, so it's always
> safe to compare pointers to test for equality?

Right. Other reason to keep those two separate, is we might change the
contents of PartitionScheme as we move forward with the reviews. May
be we should revisit it after we have finalised the design.

>
> 0005-Canonical-partition-scheme.patch:
>
> +/*
> + * get_relation_partition_info
> + *
> + * Retrieves partitioning information for a given relation.
> + *
> + * For a partitioned table it sets partitioning scheme, partition key
> + * expressions, number of partitions and OIDs of partitions in the given
> + * RelOptInfo.
> + */
> +static void
> +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
> + Relation relation)
>
> Would this be better called "set_relation_partition_info"? It doesn't
> really "retrieve" the information, it "installs" it.

Yes. Done.

>
> +{
> + PartitionDesc part_desc;
> +
> + /* No partitioning information for an unpartitioned relation. */
> + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
> + !(rel->part_scheme = find_partition_scheme(root, relation)))
> + return;
>
> Here and elsewhere you use the idiom !(foo = bar), which is perfectly
> good C in my book but I understand the project convention is to avoid
> implicit pointer->boolean conversion and to prefer expressions like
> (foo = bar) != NULL and there is certainly a lot more code like that.

PG code uses both the styles, search "if (!" in execExpr.c,
createplan.c for example.
I find this style useful, when I want to code, say "if this
relation does not have a partitioning scheme" rather than "if this
relation have NULL partitioning scheme".

>
> 0007-Partition-wise-join-implementation.patch
>
> + {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD,
>
> This GUC should appear in postgresql.conf.sample.

Done.

Attached patches with the comments addressed.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v25.tar.gz	application/x-gzip	148.5 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-10 12:14:57
Message-ID:	CAFjFpReHma+aVSf2Jynr-gKhnzezBkqMsPA7Y5O1uPgrd3zicg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 10, 2017 at 3:13 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Your patch didn't improve planning time without partition-wise join,
>> so it's something good to have along-with partition-wise join. Given
>> that Bitmapsets are used in other parts of code as well, the
>> optimization may affect those parts as well, esp. the overhead of
>> maintaining first_non_empty_wordnum.
>
> Maybe, but if you consider that this container already deals with the
> upper bound moving up by reallocating and copying the whole thing,
> adjusting an int when the lower bound moves down doesn't seem like
> anything to worry about...

Yeah. May be we should check whether that makes any difference to
planning times of TPC-H queries. If it shows any difference.

>
>> Do you think we solving this problem is a prerequisite for
>> partition-wise join? Or should we propose that patch as a separate
>> enhancement?
>
> No, I'm not proposing anything yet. For now I just wanted to share
> this observation about where hot CPU time goes in simple tests, and
> since it turned out to be a loop in a loop that I could see an easy to
> way to fix for singleton sets and sets with a small range, I couldn't
> help trying it... But I'm still trying to understand the bigger
> picture. I'll be interested to compare profiles with the ordered
> append_rel_list version you have mentioned, to see how that moves the
> hot spots.

build_simple_rel() which contains that loop takes only .23% of
planning time. So, I doubt if that's going to change much.
+ 0.23% postgres postgres [.] build_simple_rel

▒

>
> I guess one very practical question to ask is: can we plan queries
> with realistic numbers of partitioned tables and partitions in
> reasonable times? Well, it certainly looks very good for hundreds of
> partitions so far... My own experience of partitioning with other
> RDBMSs has been on that order, 'monthly partitions covering the past
> 10 years' and similar, but on the other hand it wouldn't be surprising
> to learn that people want to go to many thousands, especially for
> schemas which just keep adding partitions over time and don't want to
> drop them. As for hash partitioning, that seems to be typically done
> with numbers like 16, 32 or 64 in other products from what I can
> glean. Speculation: perhaps hash partitioning is more motivated by
> parallelism than data maintenance and thus somehow anchored to the
> ground by core counts; if so no planning performance worries there I
> guess (until core counts double quite a few more times).

Agreed.

>
> One nice thing about the planning time is that restrictions on the
> partition key cut down planning time; so where I measure ~7 seconds to
> plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions,
> if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so
> if someone has a keep-adding-more-partitions-over-time model then at
> least their prunable current day/week/whatever queries will not suffer
> quite so badly. (Yeah my computer seems to be a lot slower than yours
> for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @
> 2.2Ghz).

That's interesting observation. Thanks for sharing it.

>
> Curious: would you consider joins between partitioned tables and
> non-partitioned tables where the join is pushed down to be a kind of
> "partition-wise join", or something else? If so, would that be a
> special case, or just the logical extreme case for
> 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
> one single "partition" on the non-partitioned side maps to all the
> partitions on the partitioned size?
>

Parameterized nest loop joins with partition key as parameters
simulate something like that. Apart from that case, I don't see any
case where such a join would be more efficient compared to the current
method of ganging all partitions and joining them to the unpartitioned
table. But oh wait, that could be useful in sharding, when the
unpartitioned table is replicated and partitioned table is distributed
across shards. So, yes, that's a useful case. I am not sure whether
it's some kind of partition-wise join; it doesn't matter, it looks
useful. Said that, I am not planning to handle it in the near future.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-10 13:26:48
Message-ID:	CA+Tgmob+cXfSJ_iUhQybRr3HjqF7YSWO4C5zVe621+enftjN0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 10, 2017 at 5:43 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> Do you think we solving this problem is a prerequisite for
>> partition-wise join? Or should we propose that patch as a separate
>> enhancement?
>
> No, I'm not proposing anything yet. For now I just wanted to share
> this observation about where hot CPU time goes in simple tests, and
> since it turned out to be a loop in a loop that I could see an easy to
> way to fix for singleton sets and sets with a small range, I couldn't
> help trying it... But I'm still trying to understand the bigger
> picture. I'll be interested to compare profiles with the ordered
> append_rel_list version you have mentioned, to see how that moves the
> hot spots.

Perhaps this is stating the obvious, but it's often better to optimize
things like this at a higher level, rather than by tinkering with
stuff like Bitmapset. On the other hand, sometimes
micro-optimizations are the way to go, because optimizing
find_ec_member_for_tle(), for example, might involve a much broader
rethink of the planner code than we want to undertake right now.

> I guess one very practical question to ask is: can we plan queries
> with realistic numbers of partitioned tables and partitions in
> reasonable times? Well, it certainly looks very good for hundreds of
> partitions so far... My own experience of partitioning with other
> RDBMSs has been on that order, 'monthly partitions covering the past
> 10 years' and similar, but on the other hand it wouldn't be surprising
> to learn that people want to go to many thousands, especially for
> schemas which just keep adding partitions over time and don't want to
> drop them.

I've been thinking that it would be good if this feature - and other
new partitioning features - could scale to about 1000 partitions
without too much trouble. Eventually, it might be nice to scale
higher, but there's not much point in making partition-wise join scale
to 100,000 partitions if we've got some other part of the system that
runs into trouble beyond 250.

> Curious: would you consider joins between partitioned tables and
> non-partitioned tables where the join is pushed down to be a kind of
> "partition-wise join", or something else? If so, would that be a
> special case, or just the logical extreme case for
> 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
> one single "partition" on the non-partitioned side maps to all the
> partitions on the partitioned size?

I think this is actually a really important case which we've just
excluded from the initial scope because the problem is hard enough
already. But it's quite possible that if you are joining partitioned
tables A and B with unpartitioned table X, the right join order could
be A-X-B; the A-X join might knock out a lot of rows. It's not great
to have to pick between doing the A-B join partitionwise and doing the
A-X join first; you want to get both things. But we can't do
everything at once.

Further down the road, there's more than one way of doing the A-X
join. You could join each partition of A to all of X, which is likely
optimal if for example you are doing a nested loop with an inner index
scan on X. But you could also partition X on the fly using A's
partitioning scheme and then join partitions of A against the
on-the-fly-partitioned version of X. That's likely to be a lot better
for a merge join with an underlying sort on X.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-15 16:45:11
Message-ID:	CA+TgmoaVZuMdNyNZhwgOX+XajsGtaDwZ7x9z_mu+hgQdKh7DZA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Attached patches with the comments addressed.

I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4
and e139f1953f29db245f60a7acb72fccb1e05d2442.

0004 doesn't apply any more, probably due to commit
d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along
these lines could be separately committed prior to the main patch, and
I think that would be a good idea just to flush out any bugs in this
part independently of the rest. However, I also think that we
probably ought to try to get Amit Langote's changes to this function
to repair the locking order and expand in bound order committed before
proceeding with these changes.

In fact, I think there's a certain amount of conflict between what's
being discussed over there and what you're trying to do here. In that
thread, we propose to move partitioned tables at any level to the
front of the inheritance expansion. Here, however, you want to expand
level by level. I think partitioned-tables-first is the right
approach for the reasons discussed on the other thread; namely, we
want to be able to prune leaf partitions before expanding them, but
that requires us to expand all the non-leaf tables first to maintain a
consistent locking order in all scenarios. So the approach you've
taken in this patch may need to be re-thought somewhat.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-16 07:31:39
Message-ID:	CAFjFpRfHRpVbpvFBsa1HbRcdJFizsHP8anDxLdfByOmphkmREA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 15, 2017 at 10:15 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Attached patches with the comments addressed.
>
> I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4
> and e139f1953f29db245f60a7acb72fccb1e05d2442.

Thanks a lot Robert. Some less patches to maintain :).

>
> 0004 doesn't apply any more, probably due to commit
> d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along
> these lines could be separately committed prior to the main patch, and
> I think that would be a good idea just to flush out any bugs in this
> part independently of the rest. However, I also think that we
> probably ought to try to get Amit Langote's changes to this function
> to repair the locking order and expand in bound order committed before
> proceeding with these changes.

I am reviewing those changes and contribute to that thread if necessary.

>
> In fact, I think there's a certain amount of conflict between what's
> being discussed over there and what you're trying to do here. In that
> thread, we propose to move partitioned tables at any level to the
> front of the inheritance expansion. Here, however, you want to expand
> level by level. I think partitioned-tables-first is the right
> approach for the reasons discussed on the other thread; namely, we
> want to be able to prune leaf partitions before expanding them, but
> that requires us to expand all the non-leaf tables first to maintain a
> consistent locking order in all scenarios. So the approach you've
> taken in this patch may need to be re-thought somewhat.
>

There are two ways we can do this
1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
or just childRTIndex (using this we can fetch childRTE calling
rtfetch()) of intermediate partitioned tables. Once we are done
expanding immediate children, call expand_inherited_rtentry()
recursively on this list.

2. expand_inherited_tables() scans root->parse->rtable only upto the
end of original range table list. Make it go beyond that end,
expanding any new entries added for intermediate partitions.

FWIW, the first option allows us to keep all AppendRelInfos
corresponding to one partitioned relation together and also expands
the whole partition hierarchy in one go. Second will require minimal
changes to expand_inherited_rtentry(). Both approaches will spend time
scanning same number of RTE; the first will have them in different
lists, and second will have them in root->parse->rtable. I don't see
one being more attractive than the other. Do you have any opinion?

I will submit the rebased patches after reviewing/adjusting Amit's
changes and also the changes in expand_inherited_rtentry() after we
have concluded the approach to be taken.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-16 11:51:43
Message-ID:	CA+TgmoavmhBYGNa=1kdw7fyNtnP3ov30G-YC_YH2=Np18K-hcw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> There are two ways we can do this
> 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
> or just childRTIndex (using this we can fetch childRTE calling
> rtfetch()) of intermediate partitioned tables. Once we are done
> expanding immediate children, call expand_inherited_rtentry()
> recursively on this list.
>
> 2. expand_inherited_tables() scans root->parse->rtable only upto the
> end of original range table list. Make it go beyond that end,
> expanding any new entries added for intermediate partitions.
>
> FWIW, the first option allows us to keep all AppendRelInfos
> corresponding to one partitioned relation together and also expands
> the whole partition hierarchy in one go. Second will require minimal
> changes to expand_inherited_rtentry(). Both approaches will spend time
> scanning same number of RTE; the first will have them in different
> lists, and second will have them in root->parse->rtable. I don't see
> one being more attractive than the other. Do you have any opinion?

I don't like option (2). I'm not sure about option (1). I think
maybe we should have two nested loops in expanded_inherited_rtentry(),
the outer one iterating over partitioned tables (or just the original
parent RTE if partitioning is not involved) and then inner one looping
over individual leaf partitions for each partitioned table. Probably
we'd end up wanting to move at least some of the logic inside the
existing loop into a subroutine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-08-21 07:03:49
Message-ID:	CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 16, 2017 at 5:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> There are two ways we can do this
>> 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
>> or just childRTIndex (using this we can fetch childRTE calling
>> rtfetch()) of intermediate partitioned tables. Once we are done
>> expanding immediate children, call expand_inherited_rtentry()
>> recursively on this list.
>>
>> 2. expand_inherited_tables() scans root->parse->rtable only upto the
>> end of original range table list. Make it go beyond that end,
>> expanding any new entries added for intermediate partitions.
>>
>> FWIW, the first option allows us to keep all AppendRelInfos
>> corresponding to one partitioned relation together and also expands
>> the whole partition hierarchy in one go. Second will require minimal
>> changes to expand_inherited_rtentry(). Both approaches will spend time
>> scanning same number of RTE; the first will have them in different
>> lists, and second will have them in root->parse->rtable. I don't see
>> one being more attractive than the other. Do you have any opinion?
>
> I don't like option (2). I'm not sure about option (1). I think
> maybe we should have two nested loops in expanded_inherited_rtentry(),
> the outer one iterating over partitioned tables (or just the original
> parent RTE if partitioning is not involved) and then inner one looping
> over individual leaf partitions for each partitioned table. Probably
> we'd end up wanting to move at least some of the logic inside the
> existing loop into a subroutine.

I originally thought to provide it along-with the changes to
expand_inherited_rtentry(), but that thread is taking longer. Jeevan
Chalke needs rebased patches for his work on aggregate pushdown and
Thomas might need them for further review. So, here they are. The last
two patches in this set implement the advanced partition matching
algorithm. Those patches are here for ready reference. One can observe
that patch doesn't change much of the basic partition-wise join
implementation. I am starting a new thread for discussing the advanced
partition matching algorithm.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v26.tar.gz	application/x-gzip	143.4 KB

From:	Antonin Houska <ah(at)cybertec(dot)at>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-01 12:35:32
Message-ID:	31895.1504269332@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> I originally thought to provide it along-with the changes to
> expand_inherited_rtentry(), but that thread is taking longer. Jeevan
> Chalke needs rebased patches for his work on aggregate pushdown and
> Thomas might need them for further review. So, here they are.

Since I have related patch in the current commitfest
(https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your
patch:

* generate_partition_wise_join_paths()

Right parenthesis is missing in the prologue.

* get_partitioned_child_rels_for_join()

I think the Assert() statement is easier to understand inside the loop, see
the assert.diff attachment.

* have_partkey_equi_join()

As the function handles generic join, this comment doesn't seem to me
relevant:

/*
* The equi-join between partition keys is strict if equi-join between
* at least one partition key is using a strict operator. See
* explanation about outer join reordering identity 3 in
* optimizer/README
*/
strict_op = op_strict(opexpr->opno);

And I think the function can return true even if strict_op is false for all
the operators evaluated in the loop.

* match_expr_to_partition_keys()

I'm not sure this comment is clear enough:

/*
* If it's a strict equi-join a NULL partition key on one side will
* not join a NULL partition key on the other side. So, rows with NULL
* partition key from a partition on one side can not join with those
* from a non-matching partition on the other side. So, search the
* nullable partition keys as well.
*/
if (!strict_op)
continue;

My understanding of the problem of NULL values generated by outer join is:
these NULL values --- if evaluated by non-strict expression --- can make row
of N-th partition on one side of the join match row(s) of *other than* N-th
partition(s) on the other side. Thus the nullable input expressions may only
be evaluated by strict operators. I think it'd be clearer if you stressed that
(undesired) *match* of partition keys can be a problem, rather than mismatch.

If you insist on your wording, then I think you should at least move the
comment below to the part that only deals with strict operators.

* There are several places where lfirst_node() macro should be used. For
example

rel = lfirst_node(RelOptInfo, lc);

instead of

rel = (RelOptInfo *) lfirst(lc);

* map_and_merge_partitions()

Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
to suppress compiler warnings) I think that this part needs more thought:

{
Assert(mergemap1[index1] != mergemap2[index2] &&
mergemap1[index1] >= 0 && mergemap2[index2] >= 0);

/*
* Both the partitions map to different merged partitions. This
* means that multiple partitions from one relation matches to one
* partition from the other relation. Partition-wise join does not
* handle this case right now, since it requires ganging multiple
* partitions together (into one RelOptInfo).
*/
merged_index = -1;
}

I could hit this path with the following test:

CREATE TABLE a(i int) PARTITION BY LIST(i);
CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
CREATE TABLE b(j int) PARTITION BY LIST(j);
CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);

SET enable_partition_wise_join TO on;

SELECT *
FROM a
FULL JOIN
b ON i = j;

I don't think there's a reason not to join a_0 partition to b_0, is there?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Attachment	Content-Type	Size
assert.diff	text/x-diff	959 bytes
map_and_merge_partitions.diff	text/x-diff	3.1 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-01 19:10:22
Message-ID:	CAFjFpRfPeezggVRpwJLcCpL+CoyUMLZDhMTbSTq1KthmBN48WA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's set of patches rebased on the latest head.

This rebase mainly changes patch 0001, which translates partition
hierarchy into inheritance hierarchy creating AppendRelInfos and
RelOptInfos for partitioned partitions. Because of that, it's not
necessary to record the partitioned partitions in a
PartitionedChildRelInfos::child_rels. The only RTI that goes in there
is the RTI of child RTE which is same as the parent RTE except inh
flag. I tried removing that with a series of changes but it seems that
following code in ExecInitModifyTable() requires it.
1897 /* The root table RT index is at the head of the
partitioned_rels list */
1898 if (node->partitioned_rels)
1899 {
1900 Index root_rti;
1901 Oid root_oid;
1902
1903 root_rti = linitial_int(node->partitioned_rels);
1904 root_oid = getrelid(root_rti, estate->es_range_table);
1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
1906 }
1907 else
1908 rel = mtstate->resultRelInfo->ri_RelationDesc;

I don't know whether we could change this code not to use
PartitionedChildRelInfos::child_rels. Removing
PartitionedChildRelInfos machinary seems like a separate patch.

The last two patches implement the advanced partition matching
algorithm and are here in this set for ready reference. Please use [1]
for discussing/reviewing those.

[1] https://www.postgresql.org/message-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches.v27.tar.gz	application/x-gzip	157.1 KB

From:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-04 03:38:55
Message-ID:	65dc81c5-5b63-520a-5b89-d73d8a2bb6d9@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/02 4:10, Ashutosh Bapat wrote:
> This rebase mainly changes patch 0001, which translates partition
> hierarchy into inheritance hierarchy creating AppendRelInfos and
> RelOptInfos for partitioned partitions. Because of that, it's not
> necessary to record the partitioned partitions in a
> PartitionedChildRelInfos::child_rels. The only RTI that goes in there
> is the RTI of child RTE which is same as the parent RTE except inh
> flag. I tried removing that with a series of changes but it seems that
> following code in ExecInitModifyTable() requires it.
> 1897 /* The root table RT index is at the head of the
> partitioned_rels list */
> 1898 if (node->partitioned_rels)
> 1899 {
> 1900 Index root_rti;
> 1901 Oid root_oid;
> 1902
> 1903 root_rti = linitial_int(node->partitioned_rels);
> 1904 root_oid = getrelid(root_rti, estate->es_range_table);
> 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
> 1906 }
> 1907 else
> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc;
>
> I don't know whether we could change this code not to use
> PartitionedChildRelInfos::child_rels.
Though I haven't read the patch yet, I think the above code is useless.
And I proposed a patch to clean it up before [1]. I'll add that patch
to the next commitfest.

Best regards,
Etsuro Fujita

[1]
https://www.postgresql.org/message-id/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8%40lab.ntt.co.jp

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-04 04:34:37
Message-ID:	cfc94b13-fd1a-fa0f-94e2-94790e8110a3@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/04 12:38, Etsuro Fujita wrote:
> On 2017/09/02 4:10, Ashutosh Bapat wrote:
>> This rebase mainly changes patch 0001, which translates partition
>> hierarchy into inheritance hierarchy creating AppendRelInfos and
>> RelOptInfos for partitioned partitions. Because of that, it's not
>> necessary to record the partitioned partitions in a
>> PartitionedChildRelInfos::child_rels. The only RTI that goes in there
>> is the RTI of child RTE which is same as the parent RTE except inh
>> flag. I tried removing that with a series of changes but it seems that
>> following code in ExecInitModifyTable() requires it.
>> 1897     /* The root table RT index is at the head of the
>> partitioned_rels list */
>> 1898     if (node->partitioned_rels)
>> 1899     {
>> 1900         Index       root_rti;
>> 1901         Oid         root_oid;
>> 1902
>> 1903         root_rti = linitial_int(node->partitioned_rels);
>> 1904         root_oid = getrelid(root_rti, estate->es_range_table);
>> 1905         rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
>> 1906     }
>> 1907     else
>> 1908         rel = mtstate->resultRelInfo->ri_RelationDesc;
>>
>> I don't know whether we could change this code not to use
>> PartitionedChildRelInfos::child_rels.

For a root partitioned tables, ModifyTable.partitioned_rels comes from
PartitionedChildRelInfo.child_rels recorded for the table by
expand_inherited_rtnentry(). In fact, the latter is copied verbatim to
ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same.
The only point of keeping those RT indexes around in the ModifyTable node
is for the executor to be able to locate partitioned table RT entries and
lock them. Without them, the executor wouldn't know about those tables at
all, because there won't be subplans corresponding to partitioned tables
in the tree and hence their RT indexes won't appear in the
ModifyTable.resultRelations list. If your patch adds partitioned child
rel AppendRelInfos back for whatever reason, you should also make sure in
inheritance_planner() that their RT indexes don't end up the
resultRelations list. See this piece of code in inheritance_planner():

1351 /* Build list of sub-paths */
1352 subpaths = lappend(subpaths, subpath);
1353
1354 /* Build list of modified subroots, too */
1355 subroots = lappend(subroots, subroot);
1356
1357 /* Build list of target-relation RT indexes */
1358 resultRelations = lappend_int(resultRelations,
appinfo->child_relid);

Maybe it won't happen, because if this appinfo corresponds to a
partitioned child table, recursion would occur and we'll get to this piece
of code for only the leaf children.

By the way, if you want to get rid of PartitionedChildRelInfo, you can do
that as long as you find some other way of putting together the
partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
node created for the root partitioned table. Currently,
PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
expand_inherited_rtentry() to pass that information to the planner's
path-generating code. We may be able to generate that list when actually
creating the path using set_append_rel_pathlist() or
inheritance_planner(), without having created a PartitionedChildRelInfo
node beforehand.

> Though I haven't read the patch yet, I think the above code is useless.
> And I proposed a patch to clean it up before [1]. I'll add that patch to
> the next commitfest.

+1.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Antonin Houska <ah(at)cybertec(dot)at>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-04 11:38:17
Message-ID:	CAFjFpRe-8P5pzfET4YZRH0Vawd0_o2TK4_zo+AjZ04mfCB3O0A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
> Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>
>> I originally thought to provide it along-with the changes to
>> expand_inherited_rtentry(), but that thread is taking longer. Jeevan
>> Chalke needs rebased patches for his work on aggregate pushdown and
>> Thomas might need them for further review. So, here they are.
>
> Since I have related patch in the current commitfest
> (https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your
> patch:
>
> * generate_partition_wise_join_paths()
>
> Right parenthesis is missing in the prologue.

Thanks for pointing that out. Fixed.

>
>
> * get_partitioned_child_rels_for_join()
>
> I think the Assert() statement is easier to understand inside the loop, see
> the assert.diff attachment.

The assert at the end of function also checks that we have got
child_rels lists for all the parents passed in. That is not checked by
your version. Furthermore, we would checked that each child_rels has
at least one element while buildings paths for base relations.
Checking the same again for joins doesn't add any value.

>
>
> * have_partkey_equi_join()
>
> As the function handles generic join, this comment doesn't seem to me
> relevant:
>
> /*
> * The equi-join between partition keys is strict if equi-join between
> * at least one partition key is using a strict operator. See
> * explanation about outer join reordering identity 3 in
> * optimizer/README
> */
> strict_op = op_strict(opexpr->opno);

What in that comment is not exactly relevant?

>
> And I think the function can return true even if strict_op is false for all
> the operators evaluated in the loop.

I think it does that. Do you have a case where it doesn't?

>
>
> * match_expr_to_partition_keys()
>
> I'm not sure this comment is clear enough:
>
> /*
> * If it's a strict equi-join a NULL partition key on one side will
> * not join a NULL partition key on the other side. So, rows with NULL
> * partition key from a partition on one side can not join with those
> * from a non-matching partition on the other side. So, search the
> * nullable partition keys as well.
> */
> if (!strict_op)
> continue;
>
> My understanding of the problem of NULL values generated by outer join is:
> these NULL values --- if evaluated by non-strict expression --- can make row
> of N-th partition on one side of the join match row(s) of *other than* N-th
> partition(s) on the other side. Thus the nullable input expressions may only
> be evaluated by strict operators. I think it'd be clearer if you stressed that
> (undesired) *match* of partition keys can be a problem, rather than mismatch

Sorry, I am not able to understand this. To me it looks like my
wording conveys what you are saying.

>
> If you insist on your wording, then I think you should at least move the
> comment below to the part that only deals with strict operators.

Done.

>
>
> * There are several places where lfirst_node() macro should be used. For
> example
>
> rel = lfirst_node(RelOptInfo, lc);
>
> instead of
>
> rel = (RelOptInfo *) lfirst(lc);

Thanks for that.

>
>
> * map_and_merge_partitions()
>
> Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
> to suppress compiler warnings) I think that this part needs more thought:
>
> {
> Assert(mergemap1[index1] != mergemap2[index2] &&
> mergemap1[index1] >= 0 && mergemap2[index2] >= 0);
>
> /*
> * Both the partitions map to different merged partitions. This
> * means that multiple partitions from one relation matches to one
> * partition from the other relation. Partition-wise join does not
> * handle this case right now, since it requires ganging multiple
> * partitions together (into one RelOptInfo).
> */
> merged_index = -1;
> }
>
> I could hit this path with the following test:
>
> CREATE TABLE a(i int) PARTITION BY LIST(i);
> CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
> CREATE TABLE b(j int) PARTITION BY LIST(j);
> CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);
>
> SET enable_partition_wise_join TO on;
>
> SELECT *
> FROM a
> FULL JOIN
> b ON i = j;
>
> I don't think there's a reason not to join a_0 partition to b_0, is there?

With the latest patchset I am seeing that partition-wise join is used
in this case. I have started a new thread [1] for advanced partition
matching patches. Please post review comments about the last two
patches on that thread.

[1] https://www.postgresql.org/message-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com

Attached patchset with above comments addressed.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v28.tar.gz	application/x-gzip	159.8 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-04 12:32:14
Message-ID:	CAFjFpRc9J+Dtw-tT6EW3uzsFiCvOHJj2g_PQeDLMvW1i9FVyDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/04 12:38, Etsuro Fujita wrote:
>> On 2017/09/02 4:10, Ashutosh Bapat wrote:
>>> This rebase mainly changes patch 0001, which translates partition
>>> hierarchy into inheritance hierarchy creating AppendRelInfos and
>>> RelOptInfos for partitioned partitions. Because of that, it's not
>>> necessary to record the partitioned partitions in a
>>> PartitionedChildRelInfos::child_rels. The only RTI that goes in there
>>> is the RTI of child RTE which is same as the parent RTE except inh
>>> flag. I tried removing that with a series of changes but it seems that
>>> following code in ExecInitModifyTable() requires it.
>>> 1897 /* The root table RT index is at the head of the
>>> partitioned_rels list */
>>> 1898 if (node->partitioned_rels)
>>> 1899 {
>>> 1900 Index root_rti;
>>> 1901 Oid root_oid;
>>> 1902
>>> 1903 root_rti = linitial_int(node->partitioned_rels);
>>> 1904 root_oid = getrelid(root_rti, estate->es_range_table);
>>> 1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
>>> 1906 }
>>> 1907 else
>>> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc;
>>>
>>> I don't know whether we could change this code not to use
>>> PartitionedChildRelInfos::child_rels.
>
> For a root partitioned tables, ModifyTable.partitioned_rels comes from
> PartitionedChildRelInfo.child_rels recorded for the table by
> expand_inherited_rtnentry(). In fact, the latter is copied verbatim to
> ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same.
> The only point of keeping those RT indexes around in the ModifyTable node
> is for the executor to be able to locate partitioned table RT entries and
> lock them. Without them, the executor wouldn't know about those tables at
> all, because there won't be subplans corresponding to partitioned tables
> in the tree and hence their RT indexes won't appear in the
> ModifyTable.resultRelations list. If your patch adds partitioned child
> rel AppendRelInfos back for whatever reason, you should also make sure in
> inheritance_planner() that their RT indexes don't end up the
> resultRelations list. See this piece of code in inheritance_planner():
>
> 1351 /* Build list of sub-paths */
> 1352 subpaths = lappend(subpaths, subpath);
> 1353
> 1354 /* Build list of modified subroots, too */
> 1355 subroots = lappend(subroots, subroot);
> 1356
> 1357 /* Build list of target-relation RT indexes */
> 1358 resultRelations = lappend_int(resultRelations,
> appinfo->child_relid);
>
> Maybe it won't happen, because if this appinfo corresponds to a
> partitioned child table, recursion would occur and we'll get to this piece
> of code for only the leaf children.

You are right. We don't execute above lines for partitioned partitions.

>
> By the way, if you want to get rid of PartitionedChildRelInfo, you can do
> that as long as you find some other way of putting together the
> partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
> node created for the root partitioned table. Currently,
> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
> expand_inherited_rtentry() to pass that information to the planner's
> path-generating code. We may be able to generate that list when actually
> creating the path using set_append_rel_pathlist() or
> inheritance_planner(), without having created a PartitionedChildRelInfo
> node beforehand.

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)

>
>> Though I haven't read the patch yet, I think the above code is useless.
>> And I proposed a patch to clean it up before [1]. I'll add that patch to
>> the next commitfest.
>
> +1.
+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 03:53:00
Message-ID:	d56f3142-14cc-a572-7110-7b94595f06ab@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/04 21:32, Ashutosh Bapat wrote:
> On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote
>> By the way, if you want to get rid of PartitionedChildRelInfo, you can do
>> that as long as you find some other way of putting together the
>> partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
>> node created for the root partitioned table. Currently,
>> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
>> expand_inherited_rtentry() to pass that information to the planner's
>> path-generating code. We may be able to generate that list when actually
>> creating the path using set_append_rel_pathlist() or
>> inheritance_planner(), without having created a PartitionedChildRelInfo
>> node beforehand.
>
> AFAIU, the list contained RTIs of the relations, which didnt' have
> corresponding AppendRelInfos to lock those relations. Now that we
> create AppendRelInfos even for partitioned partitions, I don't think
> we need the list to take care of the locks.
I don't think so either. (Since I haven't followed discussions on this
thread in detail yet, I don't understand the idea/need of creating
AppendRelInfos for partitioned partitions, though.)

>>> Though I haven't read the patch yet, I think the above code is useless.
>>> And I proposed a patch to clean it up before [1]. I'll add that patch to
>>> the next commitfest.
>>
>> +1.
> +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

No. The patch just removes the partitioned_rels list from
nodeModifyTable.c.

Best regards,
Etsuro Fujita

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 04:20:16
Message-ID:	df1d4144-774d-5e9b-f0d8-62225ae52e71@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/04 21:32, Ashutosh Bapat wrote:
> On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote wrote:
>> By the way, if you want to get rid of PartitionedChildRelInfo, you can do
>> that as long as you find some other way of putting together the
>> partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
>> node created for the root partitioned table. Currently,
>> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
>> expand_inherited_rtentry() to pass that information to the planner's
>> path-generating code. We may be able to generate that list when actually
>> creating the path using set_append_rel_pathlist() or
>> inheritance_planner(), without having created a PartitionedChildRelInfo
>> node beforehand.
>
> AFAIU, the list contained RTIs of the relations, which didnt' have
> corresponding AppendRelInfos to lock those relations. Now that we
> create AppendRelInfos even for partitioned partitions, I don't think
> we need the list to take care of the locks. Is there any other reason
> why we maintain that list (apart from the trigger case I have raised
> and Fujita-san says that the list is not required in that case as
> well.)

We do *need* the list in ModifyTable (Append/MergeAppend) node itself. We
can, however, get rid of the PartitionedChildRelInfo node that carries the
partitioned child RT indexes from an earlier planning phase
(expand_inherited_rtentry) to a later phase
(create_{modifytable|append|merge_append}_path). The later phase can
build that list from the AppendRelInfos that you mention we now [1] build.

As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't
say that partitioned_rels itself is useless, just that its particular
usage in ExecInitModifyTable is. We still need that list for planner to
tell the executor that there are some RT entries the latter would need to
lock before executing a given plan. Without that dedicated list, the
executor cannot know at all that certain tables in the partition tree
(viz. the partitioned ones) need to be locked. I mentioned the reason -
(Merge)Append.subplans, ModifyTable.resultRelations does not contain
respective entries corresponding to the partitioned tables, and
traditionally, the executor looks at those lists to figure out the tables
to lock.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 06:24:57
Message-ID:	d56793a6-1b13-f40c-f0cb-ce5576303e9c@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/05 13:20, Amit Langote wrote:
> The later phase can
> build that list from the AppendRelInfos that you mention we now [1] build.
>
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154

Looking at that commit again, AppendRelInfos are still not created for
partitioned child tables. Looking at the code in
expand_single_inheritance_child(), which exists as of 30833ba154:

/*
* Build an AppendRelInfo for this parent and child, unless the child is a
* partitioned table.
*/
if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh)
{
...code that builds AppendRelInfo...
}
else
*partitioned_child_rels = lappend_int(*partitioned_child_rels,
childRTindex);

you can see that an AppendRelInfo won't get built for partitioned child
tables.

Also, even if the commit changed things so that the child RT entries (and
AppendRelInfos) now get built in an order determined by depth-first
traversal of the partition tree, the same original parent RT index is used
to mark all AppendRelInfos, so the expansion essentially flattens the
hierarchy. In the updated patch I will post on the "path toward faster
partition pruning" thread [1], I am planning to rejigger things so that
two things start to happen:

1. For partitioned child tables, build the child RT entry with inh = true
and also build an AppendRelInfos

2. When recursively expanding a partitioned child table in
expand_partitioned_rtentry(), pass its new RT index as the
parentRTindex to the recursive call of expand_partitioned_rtentry(), so
that the resulting AppendRelInfos reflect immediate parent-child
relationship

With 1 in place, build_simple_rel() will build RelOptInfos even for
partitioned child tables, so that for each one, we can recursively build
an Append path. So, instead of just one Append path for the root
partitioned table, there is one for each partitioned table in the tree.

I will be including the above described change in the partition-pruning
patch, because the other code in that patch relies on the same and I know
Ashuotsh has wanted that for a long time. :)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/044e2e09-9690-7aff-1749-2d318da38a11%40lab.ntt.co.jp

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 06:30:02
Message-ID:	CAFjFpRdMNrd_y8AhR7zjkWmwfe5y0iR7Z6XBHBXzkY=TSBDqTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 5, 2017 at 11:54 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/05 13:20, Amit Langote wrote:
>> The later phase can
>> build that list from the AppendRelInfos that you mention we now [1] build.
>>
>> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154
>
> Looking at that commit again, AppendRelInfos are still not created for
> partitioned child tables. Looking at the code in
> expand_single_inheritance_child(), which exists as of 30833ba154:
>
>
> /*
> * Build an AppendRelInfo for this parent and child, unless the child is a
> * partitioned table.
> */
> if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh)
> {
> ...code that builds AppendRelInfo...
> }
> else
> *partitioned_child_rels = lappend_int(*partitioned_child_rels,
> childRTindex);
>
> you can see that an AppendRelInfo won't get built for partitioned child
> tables.
>
> Also, even if the commit changed things so that the child RT entries (and
> AppendRelInfos) now get built in an order determined by depth-first
> traversal of the partition tree, the same original parent RT index is used
> to mark all AppendRelInfos, so the expansion essentially flattens the
> hierarchy. In the updated patch I will post on the "path toward faster
> partition pruning" thread [1], I am planning to rejigger things so that
> two things start to happen:
>
> 1. For partitioned child tables, build the child RT entry with inh = true
> and also build an AppendRelInfos
>
> 2. When recursively expanding a partitioned child table in
> expand_partitioned_rtentry(), pass its new RT index as the
> parentRTindex to the recursive call of expand_partitioned_rtentry(), so
> that the resulting AppendRelInfos reflect immediate parent-child
> relationship
>
> With 1 in place, build_simple_rel() will build RelOptInfos even for
> partitioned child tables, so that for each one, we can recursively build
> an Append path. So, instead of just one Append path for the root
> partitioned table, there is one for each partitioned table in the tree.
>
> I will be including the above described change in the partition-pruning
> patch, because the other code in that patch relies on the same and I know
> Ashuotsh has wanted that for a long time. :)

Those changes are already part of my updated 0001 patch. Aren't they?
May be you should just review those and see if those are suitable for
you?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 06:35:20
Message-ID:	c5e54c43-7578-5934-bd48-61be9e6c2df7@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/05 13:20, Amit Langote wrote:
> On 2017/09/04 21:32, Ashutosh Bapat wrote:

>> +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?
>
> As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't
> say that partitioned_rels itself is useless, just that its particular
> usage in ExecInitModifyTable is.

That's right. (I thought there would probably be no need to create that
list if we created AppendRelInfos even for partitioned partitions.)

> We still need that list for planner to
> tell the executor that there are some RT entries the latter would need to
> lock before executing a given plan. Without that dedicated list, the
> executor cannot know at all that certain tables in the partition tree
> (viz. the partitioned ones) need to be locked. I mentioned the reason -
> (Merge)Append.subplans, ModifyTable.resultRelations does not contain
> respective entries corresponding to the partitioned tables, and
> traditionally, the executor looks at those lists to figure out the tables
> to lock.

I think so too.

Best regards,
Etsuro Fujita

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 06:36:30
Message-ID:	d3a3bf11-c53c-cf50-f8e2-098f1b657fda@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/05 15:30, Ashutosh Bapat wrote:
> Those changes are already part of my updated 0001 patch. Aren't they?
> May be you should just review those and see if those are suitable for
> you?

Yeah, I think it's going to be the same patch, functionality-wise.

And sorry, I didn't realize you were talking about the case after applying
your patch on HEAD.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 06:43:21
Message-ID:	CAFjFpRfSB5YU87sUKXV+Z3hioGkuH+wJR22qY6D64UTqGhiwOg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 5, 2017 at 12:06 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/05 15:30, Ashutosh Bapat wrote:
>> Those changes are already part of my updated 0001 patch. Aren't they?
>> May be you should just review those and see if those are suitable for
>> you?
>
> Yeah, I think it's going to be the same patch, functionality-wise.
>
> And sorry, I didn't realize you were talking about the case after applying
> your patch on HEAD.
>

Ok. Can you please answer my previous questions?

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions with my 0001
patch, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)?

Having asked that, I think my patch shouldn't deal with removing
partitioned_rels lists and related structures and code. It should be
done as a separate patch.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 07:46:15
Message-ID:	34e99e97-f099-12f1-73f7-657366d74606@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/05 15:43, Ashutosh Bapat wrote:
> Ok. Can you please answer my previous questions?
>
> AFAIU, the list contained RTIs of the relations, which didnt' have
> corresponding AppendRelInfos to lock those relations. Now that we
> create AppendRelInfos even for partitioned partitions with my 0001
> patch, I don't think
> we need the list to take care of the locks. Is there any other reason
> why we maintain that list (apart from the trigger case I have raised
> and Fujita-san says that the list is not required in that case as
> well.)?

AppendRelInfos exist within the planner (they come to be and go away
within the planner). Once we leave the planner, that information is gone.

Executor will receive a plan node that won't contain that information:

1. Append has an appendplans field, which contains one plan tree for every
leaf partition. None of its fields, other than partitined_rels,
contains the RT indexes of the partitioned tables. Similarly in the
case of MergeAppend.

2. ModifyTable has a resultRelations fields which contains a list of leaf
partition RT indexes and a plans field which contains one plan tree for
every RT index in the resultRelations list (that is a plan tree that
will scan the particular leaf partition). None of its fields, other
than partitined_rels, contains the RT indexes of the partitioned
tables.

I learned over the course of developing the patch that added this
partitioned_rels field [1] that the executor needs to identify all the
affected tables by a given plan tree so that it could lock them. Executor
needs to lock them separately even if the plan itself was built after
locking all the relevant tables [2]. For example, see
ExecLockNonLeafAppendTables(), which will lock the tables in the
(Merge)Append.partitioned_rels list.

While I've been thinking all along that the same thing must be happening
for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
times on this thread), it's actually not. Instead,
ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
merged into PlannedStmt.nonleafResultRelations (which happens in
set_plan_refs) and that's where the executor finds them to lock them
(which happens in InitPlan).

So, it appears that ModifyTable.partitioned_rels is indeed unused in the
executor. But we still can't get rid of it from the ModifyTable node
itself without figuring out a way (a channel) to transfer that information
into PlannedStmt.nonleafResultRelations.

> Having asked that, I think my patch shouldn't deal with removing
> partitioned_rels lists and related structures and code. It should be> done as a separate patch.

Going back to your original email which started this discussion, it seems
that we agree on that the PartitionedChildRelInfo node can be removed, and
I agree that it shouldn't be done in the partitionwise-join patch series
but as a separate patch. As described above, we shouldn't try yet to get
rid of the partitioned_rels list that appears in some plan nodes.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3cc37f1d8

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYiwviCDRi3Zk%2BQuXj1r7uMu9T_kDNq%2B17PCWgzrbzw8A%40mail.gmail.com

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 09:30:57
Message-ID:	CAFjFpRfd_8UBY_FCq4SfKLQaG+ERu8aqnHc7ejwcc170yt-QJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/05 15:43, Ashutosh Bapat wrote:
>> Ok. Can you please answer my previous questions?
>>
>> AFAIU, the list contained RTIs of the relations, which didnt' have
>> corresponding AppendRelInfos to lock those relations. Now that we
>> create AppendRelInfos even for partitioned partitions with my 0001
>> patch, I don't think
>> we need the list to take care of the locks. Is there any other reason
>> why we maintain that list (apart from the trigger case I have raised
>> and Fujita-san says that the list is not required in that case as
>> well.)?
>
> AppendRelInfos exist within the planner (they come to be and go away
> within the planner). Once we leave the planner, that information is gone.
>
> Executor will receive a plan node that won't contain that information:
>
> 1. Append has an appendplans field, which contains one plan tree for every
> leaf partition. None of its fields, other than partitined_rels,
> contains the RT indexes of the partitioned tables. Similarly in the
> case of MergeAppend.
>
> 2. ModifyTable has a resultRelations fields which contains a list of leaf
> partition RT indexes and a plans field which contains one plan tree for
> every RT index in the resultRelations list (that is a plan tree that
> will scan the particular leaf partition). None of its fields, other
> than partitined_rels, contains the RT indexes of the partitioned
> tables.
>
> I learned over the course of developing the patch that added this
> partitioned_rels field [1] that the executor needs to identify all the
> affected tables by a given plan tree so that it could lock them. Executor
> needs to lock them separately even if the plan itself was built after
> locking all the relevant tables [2]. For example, see
> ExecLockNonLeafAppendTables(), which will lock the tables in the
> (Merge)Append.partitioned_rels list.
>
> While I've been thinking all along that the same thing must be happening
> for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
> times on this thread), it's actually not. Instead,
> ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
> merged into PlannedStmt.nonleafResultRelations (which happens in
> set_plan_refs) and that's where the executor finds them to lock them
> (which happens in InitPlan).
>
> So, it appears that ModifyTable.partitioned_rels is indeed unused in the
> executor. But we still can't get rid of it from the ModifyTable node
> itself without figuring out a way (a channel) to transfer that information
> into PlannedStmt.nonleafResultRelations.

Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am
not adding any partitioned partition other than the parent itself. But
since every partitioned partition in turn acts as parent, it appears
its own list. The list obtained by concatenating all such lists
together contains all the partitioned partition RTIs. In my patch, I
need to teach accumulate_append_subpath() to accumulate
partitioned_rels as well.

>
>> Having asked that, I think my patch shouldn't deal with removing
>> partitioned_rels lists and related structures and code. It should be> done as a separate patch.
>
> Going back to your original email which started this discussion, it seems
> that we agree on that the PartitionedChildRelInfo node can be removed, and
> I agree that it shouldn't be done in the partitionwise-join patch series
> but as a separate patch. As described above, we shouldn't try yet to get
> rid of the partitioned_rels list that appears in some plan nodes.

Thanks.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-05 11:01:31
Message-ID:	CAFjFpRfRDhWp=oguNjyzN=NMoOD+RCC3wS+b+xbGKwKUk0dRKg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 5, 2017 at 3:00 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> On 2017/09/05 15:43, Ashutosh Bapat wrote:
>>> Ok. Can you please answer my previous questions?
>>>
>>> AFAIU, the list contained RTIs of the relations, which didnt' have
>>> corresponding AppendRelInfos to lock those relations. Now that we
>>> create AppendRelInfos even for partitioned partitions with my 0001
>>> patch, I don't think
>>> we need the list to take care of the locks. Is there any other reason
>>> why we maintain that list (apart from the trigger case I have raised
>>> and Fujita-san says that the list is not required in that case as
>>> well.)?
>>
>> AppendRelInfos exist within the planner (they come to be and go away
>> within the planner). Once we leave the planner, that information is gone.
>>
>> Executor will receive a plan node that won't contain that information:
>>
>> 1. Append has an appendplans field, which contains one plan tree for every
>> leaf partition. None of its fields, other than partitined_rels,
>> contains the RT indexes of the partitioned tables. Similarly in the
>> case of MergeAppend.
>>
>> 2. ModifyTable has a resultRelations fields which contains a list of leaf
>> partition RT indexes and a plans field which contains one plan tree for
>> every RT index in the resultRelations list (that is a plan tree that
>> will scan the particular leaf partition). None of its fields, other
>> than partitined_rels, contains the RT indexes of the partitioned
>> tables.
>>
>> I learned over the course of developing the patch that added this
>> partitioned_rels field [1] that the executor needs to identify all the
>> affected tables by a given plan tree so that it could lock them. Executor
>> needs to lock them separately even if the plan itself was built after
>> locking all the relevant tables [2]. For example, see
>> ExecLockNonLeafAppendTables(), which will lock the tables in the
>> (Merge)Append.partitioned_rels list.
>>
>> While I've been thinking all along that the same thing must be happening
>> for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
>> times on this thread), it's actually not. Instead,
>> ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
>> merged into PlannedStmt.nonleafResultRelations (which happens in
>> set_plan_refs) and that's where the executor finds them to lock them
>> (which happens in InitPlan).
>>
>> So, it appears that ModifyTable.partitioned_rels is indeed unused in the
>> executor. But we still can't get rid of it from the ModifyTable node
>> itself without figuring out a way (a channel) to transfer that information
>> into PlannedStmt.nonleafResultRelations.
>
> Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am
> not adding any partitioned partition other than the parent itself. But
> since every partitioned partition in turn acts as parent, it appears
> its own list. The list obtained by concatenating all such lists
> together contains all the partitioned partition RTIs. In my patch, I
> need to teach accumulate_append_subpath() to accumulate
> partitioned_rels as well.
>

accumulate_append_subpath() is executed for every path instead of
every relation, so changing it would collect the same list multiple
times. Instead, I found the old way of associating all intermediate
partitions with the root partitioned relation work better. Here's the
updated patch set.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v29.tar.gz	application/x-gzip	165.6 KB

From:	Antonin Houska <ah(at)cybertec(dot)at>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-07 11:02:55
Message-ID:	11888.1504782175@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
> >
> >
> >
> > * get_partitioned_child_rels_for_join()
> >
> > I think the Assert() statement is easier to understand inside the loop, see
> > the assert.diff attachment.

> The assert at the end of function also checks that we have got
> child_rels lists for all the parents passed in.

Really? I can imagine that some instances of PartitionedChildRelInfo have the
child_rels list empty, while other ones have these lists long enough to
compensate for the empty lists.

> >
> >
> > * have_partkey_equi_join()
> >
> > As the function handles generic join, this comment doesn't seem to me
> > relevant:
> >
> > /*
> > * The equi-join between partition keys is strict if equi-join between
> > * at least one partition key is using a strict operator. See
> > * explanation about outer join reordering identity 3 in
> > * optimizer/README
> > */
> > strict_op = op_strict(opexpr->opno);
>
> What in that comment is not exactly relevant?

Basically I don't understand why you mention join reordering here. The join
ordering questions must all have been resolved by the time
have_partkey_equi_join() is called.

> >
> > And I think the function can return true even if strict_op is false for all
> > the operators evaluated in the loop.
>
> I think it does that. Do you have a case where it doesn't?

Here I refer to this part of the comment above:

"... if equi-join between at least one partition key is using a strict
operator."

My understanding of the code (especially match_expr_to_partition_keys) is that
no operator actually needs to be strict as long as each operator involved in
the join matches at least one non-nullable expression on both sides of the
join.

> > * match_expr_to_partition_keys()
> >
> > I'm not sure this comment is clear enough:
> >
> > /*
> > * If it's a strict equi-join a NULL partition key on one side will
> > * not join a NULL partition key on the other side. So, rows with NULL
> > * partition key from a partition on one side can not join with those
> > * from a non-matching partition on the other side. So, search the
> > * nullable partition keys as well.
> > */
> > if (!strict_op)
> > continue;
> >
> > My understanding of the problem of NULL values generated by outer join is:
> > these NULL values --- if evaluated by non-strict expression --- can make row
> > of N-th partition on one side of the join match row(s) of *other than* N-th
> > partition(s) on the other side. Thus the nullable input expressions may only
> > be evaluated by strict operators. I think it'd be clearer if you stressed that
> > (undesired) *match* of partition keys can be a problem, rather than mismatch
>
> Sorry, I am not able to understand this. To me it looks like my
> wording conveys what you are saying.

I just tried to expreess the idea in a way that is clearer to me. I think we
both mean the same. Not sure I should spend more effort on another version of
the comment.

> > If you insist on your wording, then I think you should at least move the
> > comment below to the part that only deals with strict operators.
>
> Done.

o.k.

> >
> > * map_and_merge_partitions()
> >
> > Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
> > to suppress compiler warnings) I think that this part needs more thought:
> >
> > {
> > Assert(mergemap1[index1] != mergemap2[index2] &&
> > mergemap1[index1] >= 0 && mergemap2[index2] >= 0);
> >
> > /*
> > * Both the partitions map to different merged partitions. This
> > * means that multiple partitions from one relation matches to one
> > * partition from the other relation. Partition-wise join does not
> > * handle this case right now, since it requires ganging multiple
> > * partitions together (into one RelOptInfo).
> > */
> > merged_index = -1;
> > }
> >
> > I could hit this path with the following test:
> >
> > CREATE TABLE a(i int) PARTITION BY LIST(i);
> > CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
> > CREATE TABLE b(j int) PARTITION BY LIST(j);
> > CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);
> >
> > SET enable_partition_wise_join TO on;
> >
> > SELECT *
> > FROM a
> > FULL JOIN
> > b ON i = j;
> >
> > I don't think there's a reason not to join a_0 partition to b_0, is there?
>
> With the latest patchset I am seeing that partition-wise join is used
> in this case. I have started a new thread [1] for advanced partition
> matching patches.

What plan do you get, with the patches from

https://www.postgresql.org/message-id/CAFjFpRfdXpuSu0pxON3dKcr8WndJkaXLzHUVax_Laod0Tgc6UQ@mail.gmail.com

I still see the join above Append, not below:

QUERY PLAN
-------------------------------------------------------------------------
Merge Full Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (a_0.i = b_0.j)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: a_0.i
-> Append (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on a_0 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: b_0.j
-> Append (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on b_0 (cost=0.00..35.50 rows=2550 width=4)

> Please post review comments about the last two patches on that thread.

ok, I'll do if I find any problem.

> [1] https://www.postgresql.org/message-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Antonin Houska <ah(at)cybertec(dot)at>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-07 12:07:59
Message-ID:	CAFjFpRc8DJw_fwVC9NNv9fjcsOevqY2foaCSziB6ohCbsD7dDA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 7, 2017 at 4:32 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
> Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>
>> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
>> >
>> >
>> >
>> > * get_partitioned_child_rels_for_join()
>> >
>> > I think the Assert() statement is easier to understand inside the loop, see
>> > the assert.diff attachment.
>
>> The assert at the end of function also checks that we have got
>> child_rels lists for all the parents passed in.
>
> Really? I can imagine that some instances of PartitionedChildRelInfo have the
> child_rels list empty, while other ones have these lists long enough to
> compensate for the empty lists.
>

That isn't true. Each child_rels list will at least have one entry.
Please see get_partitioned_child_rels().

>> >
>> >
>> > * have_partkey_equi_join()
>> >
>> > As the function handles generic join, this comment doesn't seem to me
>> > relevant:
>> >
>> > /*
>> > * The equi-join between partition keys is strict if equi-join between
>> > * at least one partition key is using a strict operator. See
>> > * explanation about outer join reordering identity 3 in
>> > * optimizer/README
>> > */
>> > strict_op = op_strict(opexpr->opno);
>>
>> What in that comment is not exactly relevant?
>
> Basically I don't understand why you mention join reordering here. The join
> ordering questions must all have been resolved by the time
> have_partkey_equi_join() is called.

I am referring to a particular section in README which talks about the
relation between strict operator and legal join order.

>
>> >
>> > And I think the function can return true even if strict_op is false for all
>> > the operators evaluated in the loop.
>>
>> I think it does that. Do you have a case where it doesn't?
>
> Here I refer to this part of the comment above:
>
> "... if equi-join between at least one partition key is using a strict
> operator."
>
> My understanding of the code (especially match_expr_to_partition_keys) is that
> no operator actually needs to be strict as long as each operator involved in
> the join matches at least one non-nullable expression on both sides of the
> join.

I don't think so. A strict operator returns NULL when either of the
inputs is NULL. We can not say so for non-strict operators, which may
deem NULL and non-NULL arguments as equal, even though that looks
insane.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-07 19:04:14
Message-ID:	CA+TgmoZEUonD9dUZH1FBEyq=PEv_KvE3wC=A=0zm-_tRz_917A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> accumulate_append_subpath() is executed for every path instead of
> every relation, so changing it would collect the same list multiple
> times. Instead, I found the old way of associating all intermediate
> partitions with the root partitioned relation work better. Here's the
> updated patch set.

When I tried out patch 0001, it crashed repeatedly during 'make check'
because of an assertion failure in get_partitioned_child_rels. It
seemed to me that the way the patch was refactoring
expand_inherited_rtentry involved more code rearrangement than
necessary, so I reverted all the code rearrangement and just kept the
functional changes, and all the crashes went away. (That refactoring
also failed to initialize has_child properly.) In so doing, I
reintroduced the problem that the PartitionedChildRelInfo lists
weren't getting set up correctly, but after some thought I realized
that was just because expand_single_inheritance_child() was choosing
between adding an RTE and adding the OID to partitioned_child_rels,
whereas for an intermediate partitioned table it needs to do both. So
I inserted a trivial fix for that problem (replacing "else" with a new
"if"-test), basically:

- else
+
+ if (childrte->relkind == RELKIND_PARTITIONED_TABLE)

Please check out the attached version of the patch. In addition to
the above simplifications, I did some adjustments to the comments in
various places - some just grammar and others a bit more substantive.
And I think I broke a long line in one place, too.

One thing I notice is that if I rip out the changes to initsplan.c,
the new regression test still passes. If it's possible to write a
test that fails without those changes, I think it would be a good idea
to include one in the patch. That's certainly one of the subtler
parts of this patch, IMHO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
expand-stepwise-rmh.patch	application/octet-stream	14.3 KB

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-08 05:47:20
Message-ID:	d2f1cdcb-ebb4-76c5-e471-79348ca5d7a7@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/08 4:04, Robert Haas wrote:
> On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> accumulate_append_subpath() is executed for every path instead of
>> every relation, so changing it would collect the same list multiple
>> times. Instead, I found the old way of associating all intermediate
>> partitions with the root partitioned relation work better. Here's the
>> updated patch set.
>
> When I tried out patch 0001, it crashed repeatedly during 'make check'
> because of an assertion failure in get_partitioned_child_rels. It
> seemed to me that the way the patch was refactoring
> expand_inherited_rtentry involved more code rearrangement than
> necessary, so I reverted all the code rearrangement and just kept the
> functional changes, and all the crashes went away. (That refactoring
> also failed to initialize has_child properly.)

When I tried the attached patch, it doesn't seem to expand partitioning
inheritance in step-wise manner as the patch's title says. I think the
rewritten patch forgot to include Ashutosh's changes to
expand_single_inheritance_child() whereby the AppendRelInfo of the child
will be marked with the direct parent instead of always the root parent.

I updated the patch to include just those changes. I'm not sure about
one of the Ashutosh's changes whereby the child PlanRowMark is also passed
to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
the child RTE, child RT index and child Relation are fine, because they
are necessary for creating AppendRelInfos in a desired way for later
planning steps. But PlanRowMarks are not processed within the planner
afterwards and do not need to be marked with the immediate parent-child
association in the same way that AppendRelInfos need to be.

I also included the changes to add_paths_to_append_rel() from my patch on
the "path toward faster partition pruning" thread. We'd need that change,
because while add_paths_to_append_rel() is called for all partitioned
table RTEs in a given partition tree, expand_inherited_rtentry() would
have set up a PartitionedChildRelInfo only for the root parent, so
get_partitioned_child_rels() would not find the same for non-root
partitioned table rels and crash failing the Assert. The changes I made
are such that we call get_partitioned_child_rels() only for the parent
rels that are known to correspond root partitioned tables (or as you
pointed out on the thread, "the table named in the query" as opposed those
added to the query as result of inheritance expansion). In addition to
the relkind check on the input RTE, it would seem that checking that the
reloptkind is RELOPT_BASEREL would be enough. But actually not, because
if a partitioned table is accessed in a UNION ALL query, reloptkind even
for the root partitioned table (the table named in the query) would be
RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
actually the root partitioned table is to check whether its parent rel is
not RTE_RELATION, because the parent in case of UNION ALL Append is a
RTE_SUBQUERY RT entry.

> One thing I notice is that if I rip out the changes to initsplan.c,
> the new regression test still passes. If it's possible to write a
> test that fails without those changes, I think it would be a good idea
> to include one in the patch. That's certainly one of the subtler
> parts of this patch, IMHO.

Back when this (step-wise expansion of partition inheritance) used to be a
patch in the original declarative partitioning patch series, Ashutosh had
reported a test query [1] that would fail getting a plan, for which we
came up with the initsplan.c changes in this patch as the solution:

ERROR: could not devise a query plan for the given query

I tried that query again without the initsplan.c changes and somehow the
same error does not occur anymore. It's strange because without the
initsplan.c changes, there is no way for partitions lower in the tree than
the first level to get the direct_lateral_relids and lateral_relids from
the root parent rel. Maybe, Ashutosh has a way to devise the failing
query again.

I also confirmed that the partition-pruning patch set works fine with this
patch instead of the patch on that thread with the same functionality,
which I will now drop from that patch set. Sorry about the wasted time.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj%3Diqg%40mail.gmail.com

Attachment	Content-Type	Size
expand-stepwise-rmh-2.patch	text/plain	19.6 KB

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-08 05:53:03
Message-ID:	49dc5182-fe4f-7ab1-0d58-7a57aa81a312@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/08 14:47, Amit Langote wrote:
> When I tried the attached patch, it doesn't seem to expand partitioning
> inheritance in step-wise manner as the patch's title says.

Oops. By "attached patch", I had meant to say the Robert's patch, that
is, expand-stepwise-rmh.patch. Not expand-stepwise-rmh-2.patch, which is
the updated version of the patch attached with the quoted message.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-08 07:43:39
Message-ID:	CAFjFpRcZ3nuMr+8B6D4_2qaGHeK7s81Hk2oVqxbZJVXe1gqOxQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 8, 2017 at 12:34 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> accumulate_append_subpath() is executed for every path instead of
>> every relation, so changing it would collect the same list multiple
>> times. Instead, I found the old way of associating all intermediate
>> partitions with the root partitioned relation work better. Here's the
>> updated patch set.
>
> When I tried out patch 0001, it crashed repeatedly during 'make check'
> because of an assertion failure in get_partitioned_child_rels.

Running "make check" on the whole patchset doesn't give that failure.
So I didn't notice the crash since I was running regression on the
whole patchset. Sorry for that. Fortunately git rebase -i allows us to
execute shell commands while applying patches, so I have set it up to
compile each patch and run regression. Hopefully that will catch such
errors in future. That process showed me that patch
0003-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch fixes
that crash by calling get_partitioned_child_rels() only on the root
partitioned table for which we have set up child_rels list. Amit
Langote has a similar fix reported in his reply. So, we will discuss
it there.

> It
> seemed to me that the way the patch was refactoring
> expand_inherited_rtentry involved more code rearrangement than
> necessary, so I reverted all the code rearrangement and just kept the
> functional changes, and all the crashes went away. (That refactoring
> also failed to initialize has_child properly.) In so doing, I
> reintroduced the problem that the PartitionedChildRelInfo lists
> weren't getting set up correctly, but after some thought I realized
> that was just because expand_single_inheritance_child() was choosing
> between adding an RTE and adding the OID to partitioned_child_rels,
> whereas for an intermediate partitioned table it needs to do both. So
> I inserted a trivial fix for that problem (replacing "else" with a new
> "if"-test), basically:
>
> - else
> +
> + if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
>
> Please check out the attached version of the patch. In addition to
> the above simplifications, I did some adjustments to the comments in
> various places - some just grammar and others a bit more substantive.
> And I think I broke a long line in one place, too.
>
> One thing I notice is that if I rip out the changes to initsplan.c,
> the new regression test still passes. If it's possible to write a
> test that fails without those changes, I think it would be a good idea
> to include one in the patch. That's certainly one of the subtler
> parts of this patch, IMHO.

Amit Langote has replied on these points as well. So, I will comment
in a reply to his reply.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-08 09:36:54
Message-ID:	CA+TgmoauESHTNwz8wgrKZoFpVCAkW-J5vAaMsS4iRrY5W_d4SQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 8, 2017 at 1:47 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> When I tried the attached patch, it doesn't seem to expand partitioning
> inheritance in step-wise manner as the patch's title says. I think the
> rewritten patch forgot to include Ashutosh's changes to
> expand_single_inheritance_child() whereby the AppendRelInfo of the child
> will be marked with the direct parent instead of always the root parent.

Woops.

> I updated the patch to include just those changes. I'm not sure about
> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
> the child RTE, child RT index and child Relation are fine, because they
> are necessary for creating AppendRelInfos in a desired way for later
> planning steps. But PlanRowMarks are not processed within the planner
> afterwards and do not need to be marked with the immediate parent-child
> association in the same way that AppendRelInfos need to be.

We probably need some better comments to explain which things need to
be marked using the immediate parent and which need to be marked using
the baserel, and why.

> I also included the changes to add_paths_to_append_rel() from my patch on
> the "path toward faster partition pruning" thread. We'd need that change,
> because while add_paths_to_append_rel() is called for all partitioned
> table RTEs in a given partition tree, expand_inherited_rtentry() would
> have set up a PartitionedChildRelInfo only for the root parent, so
> get_partitioned_child_rels() would not find the same for non-root
> partitioned table rels and crash failing the Assert. The changes I made
> are such that we call get_partitioned_child_rels() only for the parent
> rels that are known to correspond root partitioned tables (or as you
> pointed out on the thread, "the table named in the query" as opposed those
> added to the query as result of inheritance expansion). In addition to
> the relkind check on the input RTE, it would seem that checking that the
> reloptkind is RELOPT_BASEREL would be enough. But actually not, because
> if a partitioned table is accessed in a UNION ALL query, reloptkind even
> for the root partitioned table (the table named in the query) would be
> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
> actually the root partitioned table is to check whether its parent rel is
> not RTE_RELATION, because the parent in case of UNION ALL Append is a
> RTE_SUBQUERY RT entry.

OK, so this needs some good comments, too...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-08 17:38:47
Message-ID:	CAFjFpRfHkJW3G=_PnSUc6PbXJE48AWYwyRzaGqtfKzzoU4wXXw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/08 4:04, Robert Haas wrote:
>> On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> accumulate_append_subpath() is executed for every path instead of
>>> every relation, so changing it would collect the same list multiple
>>> times. Instead, I found the old way of associating all intermediate
>>> partitions with the root partitioned relation work better. Here's the
>>> updated patch set.
>>
>> When I tried out patch 0001, it crashed repeatedly during 'make check'
>> because of an assertion failure in get_partitioned_child_rels. It
>> seemed to me that the way the patch was refactoring
>> expand_inherited_rtentry involved more code rearrangement than
>> necessary, so I reverted all the code rearrangement and just kept the
>> functional changes, and all the crashes went away. (That refactoring
>> also failed to initialize has_child properly.)
>
> When I tried the attached patch, it doesn't seem to expand partitioning
> inheritance in step-wise manner as the patch's title says. I think the
> rewritten patch forgot to include Ashutosh's changes to
> expand_single_inheritance_child() whereby the AppendRelInfo of the child
> will be marked with the direct parent instead of always the root parent.

Right. If we apply 0002 from partition-wise join patchset, which has
changed build_simple_rel() to collect direct children of a given
partitioned table, it introduces another crash because of Assertion
failure; for a partitioned table build_simple_rel() finds more
children than expected because indirect children are also counted as
direct children.

>
> I updated the patch to include just those changes. I'm not sure about
> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
> the child RTE, child RT index and child Relation are fine, because they
> are necessary for creating AppendRelInfos in a desired way for later
> planning steps. But PlanRowMarks are not processed within the planner
> afterwards and do not need to be marked with the immediate parent-child
> association in the same way that AppendRelInfos need to be.

Passing top parent's row mark works today, since there is no
parent-child specific translation happening there. But if in future,
we introduce such a translation, row marks for indirect children in a
multi-level partitioned hierarchy won't be accurate. So, I think it's
better to pass row marks of the direct parent.

>
> I also included the changes to add_paths_to_append_rel() from my patch on
> the "path toward faster partition pruning" thread. We'd need that change,
> because while add_paths_to_append_rel() is called for all partitioned
> table RTEs in a given partition tree, expand_inherited_rtentry() would
> have set up a PartitionedChildRelInfo only for the root parent, so
> get_partitioned_child_rels() would not find the same for non-root
> partitioned table rels and crash failing the Assert. The changes I made
> are such that we call get_partitioned_child_rels() only for the parent
> rels that are known to correspond root partitioned tables (or as you
> pointed out on the thread, "the table named in the query" as opposed those
> added to the query as result of inheritance expansion). In addition to
> the relkind check on the input RTE, it would seem that checking that the
> reloptkind is RELOPT_BASEREL would be enough. But actually not, because
> if a partitioned table is accessed in a UNION ALL query, reloptkind even
> for the root partitioned table (the table named in the query) would be
> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
> actually the root partitioned table is to check whether its parent rel is
> not RTE_RELATION, because the parent in case of UNION ALL Append is a
> RTE_SUBQUERY RT entry.
>

There was a change in my 0003 patch, which fixed the crash. It checked
for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
my 0001 patch. It no more crashes. I tried various queries involving
set operations and bare multi-level partitioned table scan with my
patch, but none of them showed any anomaly. Do you have a testcase
which shows problem with my patch? May be your suggestion is correct,
but corresponding code implementation is slightly longer than I would
expect. So, we should go with it, if there is corresponding testcase
which shows why it's needed.

In your patch
+ parent_rel = root->simple_rel_array[parent_relid];
+ get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
Do you mean RTE_RELATION as you explained above?

>> One thing I notice is that if I rip out the changes to initsplan.c,
>> the new regression test still passes. If it's possible to write a
>> test that fails without those changes, I think it would be a good idea
>> to include one in the patch. That's certainly one of the subtler
>> parts of this patch, IMHO.
>
> Back when this (step-wise expansion of partition inheritance) used to be a
> patch in the original declarative partitioning patch series, Ashutosh had
> reported a test query [1] that would fail getting a plan, for which we
> came up with the initsplan.c changes in this patch as the solution:
>
> ERROR: could not devise a query plan for the given query
>
> I tried that query again without the initsplan.c changes and somehow the
> same error does not occur anymore. It's strange because without the
> initsplan.c changes, there is no way for partitions lower in the tree than
> the first level to get the direct_lateral_relids and lateral_relids from
> the root parent rel. Maybe, Ashutosh has a way to devise the failing
> query again.
>

Thanks a lot for the reference. I devised a testcase slightly
modifying my original test. I have included the test in the latest
patch set.

I have included Robert's changes to parts other than
expand_inherited_rtentry() in the patch.

>
> I also confirmed that the partition-pruning patch set works fine with this
> patch instead of the patch on that thread with the same functionality,
> which I will now drop from that patch set. Sorry about the wasted time.
>

Thanks a lot. Please review the patch in the updated patchset.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v30.tar.gz	application/x-gzip	166.9 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-09 00:58:25
Message-ID:	CA+TgmoaD8WiqNCzsVuu88WstWL4dysckc9cX5SWd8yAb--a5qw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> I also confirmed that the partition-pruning patch set works fine with this
>> patch instead of the patch on that thread with the same functionality,
>> which I will now drop from that patch set. Sorry about the wasted time.
>
> Thanks a lot. Please review the patch in the updated patchset.

In set_append_rel_size(), I don't find the comment too clear (and this
part was taken from Amit's patch, right?). I suggest:

+ /*
+ * Associate the partitioned tables which are descendents of the table
+ * named in the query with the topmost append path (i.e. the one where
+ * rel->reloptkind is RELOPT_BASEREL). This ensures that they get properly
+ * locked at execution time.
+ */

I'm a bit suspicious about the fact that there are now executor
changes related to the PlanRowMarks. If the rowmark's prti is now the
intermediate parent's RT index rather than the top-parent's RT index,
it'd seem like that'd matter somehow. Maybe it doesn't, because the
code that cares about prti seems to only care about whether it's
different from rti. But if that's true everywhere, then why even
change this? I think we might be well off not to tinker with things
that don't need to be changed.

Apart from that concern, now that I understand (from my own failed
attempt and some off-list discussion) why this patch works the way it
does, I think this is in fairly good shape.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 06:46:59
Message-ID:	05a4e009-3ea6-b294-4668-be745e2ca836@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/09 2:38, Ashutosh Bapat wrote:
> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>> I updated the patch to include just those changes. I'm not sure about
>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
>> the child RTE, child RT index and child Relation are fine, because they
>> are necessary for creating AppendRelInfos in a desired way for later
>> planning steps. But PlanRowMarks are not processed within the planner
>> afterwards and do not need to be marked with the immediate parent-child
>> association in the same way that AppendRelInfos need to be.
>
> Passing top parent's row mark works today, since there is no
> parent-child specific translation happening there. But if in future,
> we introduce such a translation, row marks for indirect children in a
> multi-level partitioned hierarchy won't be accurate. So, I think it's
> better to pass row marks of the direct parent.

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something. We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

>> I also included the changes to add_paths_to_append_rel() from my patch on
>> the "path toward faster partition pruning" thread. We'd need that change,
>> because while add_paths_to_append_rel() is called for all partitioned
>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>> have set up a PartitionedChildRelInfo only for the root parent, so
>> get_partitioned_child_rels() would not find the same for non-root
>> partitioned table rels and crash failing the Assert. The changes I made
>> are such that we call get_partitioned_child_rels() only for the parent
>> rels that are known to correspond root partitioned tables (or as you
>> pointed out on the thread, "the table named in the query" as opposed those
>> added to the query as result of inheritance expansion). In addition to
>> the relkind check on the input RTE, it would seem that checking that the
>> reloptkind is RELOPT_BASEREL would be enough. But actually not, because
>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>> for the root partitioned table (the table named in the query) would be
>> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
>> actually the root partitioned table is to check whether its parent rel is
>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>> RTE_SUBQUERY RT entry.
>>
>
> There was a change in my 0003 patch, which fixed the crash. It checked
> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
> my 0001 patch. It no more crashes. I tried various queries involving
> set operations and bare multi-level partitioned table scan with my
> patch, but none of them showed any anomaly. Do you have a testcase
> which shows problem with my patch? May be your suggestion is correct,
> but corresponding code implementation is slightly longer than I would
> expect. So, we should go with it, if there is corresponding testcase
> which shows why it's needed.

If we go with your patch, partitioned tables won't get locked, for
example, in case of the following query (p is a partitioned table):

select 1 from p union all select 2 from p;

That's because the RelOptInfos for the two instances of p in the above
query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
of the Append corresponding to the UNION ALL subquery RTE. So,
partitioned_rels does not get set per your proposed code.

>
> In your patch
>
> + parent_rel = root->simple_rel_array[parent_relid];
> + get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
>
> Do you mean RTE_RELATION as you explained above?

No, I mean RTE_SUBQUERY.

If the partitioned table RTE in question corresponds to one named in the
query, we should be able to find its pcinfo in root->pcinfo_list. If the
partitioned table RTE is one added as result of inheritance expansion, it
won't have an associated pcinfo. So, we should find a way to distinguish
them from one another. The first idea that had occurred to me was the
same as yours -- RelOptInfo of the partitioned table RTE named in the
query would be of reloptkind RELOPT_BASEREL and those of the partitioned
table RTE added as result of inheritance expansion will be of reloptkind
RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former
is not. If the partitioned table named in the query appears under UNION
ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That
means we have to use some other means to distinguish partitioned table
RTEs that have an associated pcinfo from those that don't. So, I devised
this method of looking at the parent RTE (if any) for distinguishing the
two. Partitioned table named in the query either doesn't have the parent
or if it does, the parent could only ever be a UNION ALL subquery
(RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion
will always have the parent and the parent will only ever be a table
(RTE_RELATION).

>>> One thing I notice is that if I rip out the changes to initsplan.c,
>>> the new regression test still passes. If it's possible to write a
>>> test that fails without those changes, I think it would be a good idea
>>> to include one in the patch. That's certainly one of the subtler
>>> parts of this patch, IMHO.
>>
>> Back when this (step-wise expansion of partition inheritance) used to be a
>> patch in the original declarative partitioning patch series, Ashutosh had
>> reported a test query [1] that would fail getting a plan, for which we
>> came up with the initsplan.c changes in this patch as the solution:
>>
>> ERROR: could not devise a query plan for the given query
>>
>> I tried that query again without the initsplan.c changes and somehow the
>> same error does not occur anymore. It's strange because without the
>> initsplan.c changes, there is no way for partitions lower in the tree than
>> the first level to get the direct_lateral_relids and lateral_relids from
>> the root parent rel. Maybe, Ashutosh has a way to devise the failing
>> query again.
>
> Thanks a lot for the reference. I devised a testcase slightly
> modifying my original test. I have included the test in the latest
> patch set.
>
> I have included Robert's changes to parts other than
> expand_inherited_rtentry() in the patch.
>
>> I also confirmed that the partition-pruning patch set works fine with this
>> patch instead of the patch on that thread with the same functionality,
>> which I will now drop from that patch set. Sorry about the wasted time.
>>
>
> Thanks a lot. Please review the patch in the updated patchset.

Some comments:

In create_lateral_join_info():

+ Assert(IS_SIMPLE_REL(brel));
+ Assert(brte);

The second Assert is either unnecessary or should be placed first.

The following comment could be made a bit clearer.

+ * In the case of table inheritance, the parent RTE is directly
linked
+ * to every child table via an AppendRelInfo. In the case of table
+ * partitioning, the inheritance hierarchy is expanded one level at a
+ * time rather than flattened. Therefore, an other member rel
that is
+ * a partitioned table may have children of its own, and must
+ * therefore be marked with the appropriate lateral info so that
those
+ * children eventually get marked also.

How about: In the case of partitioned table inheritance, the original
parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
Partitions below the first level are accessible only via their immediate
parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
consider those as well.

In expand_inherited_rtentry(), the following comment fragment is obsolete,
because we *do* now create AppendRelInfo's for partitioned children:

+ /*
+ * We keep a list of objects in root, each of which maps a
partitioned
+ * parent RT index to the list of RT indexes of its partitioned child
+ * tables which do not have AppendRelInfos associated with those.

By the way, when we call expand_single_inheritance_child() in the
non-partitioned inheritance case, we should pass NULL for childrte_p,
childRTindex_p, childrc_p, instead of declaring variables that won't be
used. Hence, expand_single_inheritance_child() should make those
arguments optional.

+
+ /*
+ * If the partitioned table has no partitions or all the partitions are
+ * temporary tables from other backends, treat this as non-inheritance
+ * case.
+ */
+ if (!has_child)
+ parentrte->inh = false;

I guess the above applies to all partitioned tables in the tree, so, I
think we should update the comment in set_rel_size():

else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
* A partitioned table without leaf partitions is marked
* as a dummy rel.
*/
set_dummy_rel_pathlist(rel);
}

to say: a partitioned table without partitions is marked as a dummy rel.

Thanks,
Amit

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 07:04:05
Message-ID:	16c95b91-0fe4-6279-05a4-f97885cdcf41@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/09 9:58, Robert Haas wrote:
> I'm a bit suspicious about the fact that there are now executor
> changes related to the PlanRowMarks. If the rowmark's prti is now the
> intermediate parent's RT index rather than the top-parent's RT index,
> it'd seem like that'd matter somehow. Maybe it doesn't, because the
> code that cares about prti seems to only care about whether it's
> different from rti.

Yes, it doesn't matter. The important point though is that nothing we
want to do in the short term requires us to set a child PlanRowMark's prti
to its immediate parent's RT index, as I also mentioned in reply to Ashutosh.

> But if that's true everywhere, then why even
> change this? I think we might be well off not to tinker with things
> that don't need to be changed.

+1.

> Apart from that concern, now that I understand (from my own failed
> attempt and some off-list discussion) why this patch works the way it
> does, I think this is in fairly good shape.

I too think so, except we still need to incorporate changes to
add_paths_to_append_rel() necessary to correctly set partitioned_rels, as
I explained in reply Ashutosh.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 07:23:09
Message-ID:	CAFjFpRfJesPswN_auC_7bnNVRG9VPb=eo5D5noFAs4e=YF8Ycw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> I also confirmed that the partition-pruning patch set works fine with this
>>> patch instead of the patch on that thread with the same functionality,
>>> which I will now drop from that patch set. Sorry about the wasted time.
>>
>> Thanks a lot. Please review the patch in the updated patchset.
>
> In set_append_rel_size(), I don't find the comment too clear (and this
> part was taken from Amit's patch, right?). I suggest:

No, I didn't take it from Amit's patch. Both of us have different
wordings. But yours is better than both of us. Included it in the
attached patches.

>
> + /*
> + * Associate the partitioned tables which are descendents of the table
> + * named in the query with the topmost append path (i.e. the one where
> + * rel->reloptkind is RELOPT_BASEREL). This ensures that they get properly
> + * locked at execution time.
> + */
>
> I'm a bit suspicious about the fact that there are now executor
> changes related to the PlanRowMarks. If the rowmark's prti is now the
> intermediate parent's RT index rather than the top-parent's RT index,
> it'd seem like that'd matter somehow. Maybe it doesn't, because the
> code that cares about prti seems to only care about whether it's
> different from rti. But if that's true everywhere, then why even
> change this? I think we might be well off not to tinker with things
> that don't need to be changed.

In the definition of ExecRowMark, I see
Index prti; /* parent range table index, if child */
It just says parent, by which I take as direct parent. For
inheritance, which earlier flattened inheritance hierarchy, direct
parent was top parent. So, probably nobody thought whether a parent is
direct parent or top parent. But now that we have introduced that
concept we need to interpret this comment anew. And I think
interpreting it as direct parent is non-lossy. If we set top parent's
index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
So, it looks quite natural that we set the direct parent's index in
PlanRowMark. From that POV, we aren't changing anything, we are
setting the same parent RTI in AppendRelInfo and PlanRowMark. Chaning
different parent RTIs in those two structure would be a real change.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 08:46:57
Message-ID:	42b0831d-5daa-8aa9-0075-065b21582118@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/11 16:23, Ashutosh Bapat wrote:
> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I'm a bit suspicious about the fact that there are now executor
>> changes related to the PlanRowMarks. If the rowmark's prti is now the
>> intermediate parent's RT index rather than the top-parent's RT index,
>> it'd seem like that'd matter somehow. Maybe it doesn't, because the
>> code that cares about prti seems to only care about whether it's
>> different from rti. But if that's true everywhere, then why even
>> change this? I think we might be well off not to tinker with things
>> that don't need to be changed.
>
> In the definition of ExecRowMark, I see
> Index prti; /* parent range table index, if child */
> It just says parent, by which I take as direct parent. For
> inheritance, which earlier flattened inheritance hierarchy, direct
> parent was top parent. So, probably nobody thought whether a parent is
> direct parent or top parent. But now that we have introduced that
> concept we need to interpret this comment anew. And I think
> interpreting it as direct parent is non-lossy.

But then we also don't have anything to say about why we're making that
change. If you could describe what non-lossy is in this context, then
fine. But that we'd like to match with what we're going to do for
AppendRelInfos does not seem to be a sufficient explanation for this change.

> If we set top parent's
> index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
> So, it looks quite natural that we set the direct parent's index in
> PlanRowMark.

They would not agree, yes, but aren't they unrelated? If we have a reason
for them to agree, (for example, row-locking breaks in the inherited table
case if we didn't), then we should definitely make them agree.

Updating the comment for prti definition might be something that this
patch could (should?) do, but I'm not quite sure about that too.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 10:45:44
Message-ID:	CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/09 2:38, Ashutosh Bapat wrote:
>> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>>> I updated the patch to include just those changes. I'm not sure about
>>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>>> to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
>>> the child RTE, child RT index and child Relation are fine, because they
>>> are necessary for creating AppendRelInfos in a desired way for later
>>> planning steps. But PlanRowMarks are not processed within the planner
>>> afterwards and do not need to be marked with the immediate parent-child
>>> association in the same way that AppendRelInfos need to be.
>>
>> Passing top parent's row mark works today, since there is no
>> parent-child specific translation happening there. But if in future,
>> we introduce such a translation, row marks for indirect children in a
>> multi-level partitioned hierarchy won't be accurate. So, I think it's
>> better to pass row marks of the direct parent.
>
> IMHO, we should make it the responsibility of the future patch to set a
> child PlanRowMark's prti to the direct parent's RT index, when we actually
> know that it's needed for something. We clearly know today why we need to
> pass the other objects like child RT entry, RT index, and Relation, so we
> should limit this patch to pass only those objects to the recursive call.
> That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

>
>>> I also included the changes to add_paths_to_append_rel() from my patch on
>>> the "path toward faster partition pruning" thread. We'd need that change,
>>> because while add_paths_to_append_rel() is called for all partitioned
>>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>>> have set up a PartitionedChildRelInfo only for the root parent, so
>>> get_partitioned_child_rels() would not find the same for non-root
>>> partitioned table rels and crash failing the Assert. The changes I made
>>> are such that we call get_partitioned_child_rels() only for the parent
>>> rels that are known to correspond root partitioned tables (or as you
>>> pointed out on the thread, "the table named in the query" as opposed those
>>> added to the query as result of inheritance expansion). In addition to
>>> the relkind check on the input RTE, it would seem that checking that the
>>> reloptkind is RELOPT_BASEREL would be enough. But actually not, because
>>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>>> for the root partitioned table (the table named in the query) would be
>>> RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
>>> actually the root partitioned table is to check whether its parent rel is
>>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>>> RTE_SUBQUERY RT entry.
>>>
>>
>> There was a change in my 0003 patch, which fixed the crash. It checked
>> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
>> my 0001 patch. It no more crashes. I tried various queries involving
>> set operations and bare multi-level partitioned table scan with my
>> patch, but none of them showed any anomaly. Do you have a testcase
>> which shows problem with my patch? May be your suggestion is correct,
>> but corresponding code implementation is slightly longer than I would
>> expect. So, we should go with it, if there is corresponding testcase
>> which shows why it's needed.
>
> If we go with your patch, partitioned tables won't get locked, for
> example, in case of the following query (p is a partitioned table):
>
> select 1 from p union all select 2 from p;
>
> That's because the RelOptInfos for the two instances of p in the above
> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
> of the Append corresponding to the UNION ALL subquery RTE. So,
> partitioned_rels does not get set per your proposed code.

Session 1:
postgres=# begin;
BEGIN
postgres=# select 1 from t1 union all select 2 from t1;
?column?
----------
(0 rows)

postgres=# select pg_backend_pid();
pg_backend_pid
----------------
28843
(1 row)

Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
locktype | relation | virtualxid | virtualtransaction | pid |
mode | granted | fastpath
------------+----------+------------+--------------------+-------+-----------------+---------+----------
relation | pg_locks | | 4/14 | 28854 |
AccessShareLock | t | t
virtualxid | | 4/14 | 4/14 | 28854 |
ExclusiveLock | t | t
relation | t1p1p1 | | 3/9 | 28843 |
AccessShareLock | t | t
relation | t1p1 | | 3/9 | 28843 |
AccessShareLock | t | t
relation | t1 | | 3/9 | 28843 |
AccessShareLock | t | t
virtualxid | | 3/9 | 3/9 | 28843 |
ExclusiveLock | t | t
(6 rows)

So, all partitioned partitions are getting locked correctly. Am I
missing something?

>
>>
>> In your patch
>>
>> + parent_rel = root->simple_rel_array[parent_relid];
>> + get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
>>
>> Do you mean RTE_RELATION as you explained above?
>
> No, I mean RTE_SUBQUERY.
>
> If the partitioned table RTE in question corresponds to one named in the
> query, we should be able to find its pcinfo in root->pcinfo_list. If the
> partitioned table RTE is one added as result of inheritance expansion, it
> won't have an associated pcinfo. So, we should find a way to distinguish
> them from one another. The first idea that had occurred to me was the
> same as yours -- RelOptInfo of the partitioned table RTE named in the
> query would be of reloptkind RELOPT_BASEREL and those of the partitioned
> table RTE added as result of inheritance expansion will be of reloptkind
> RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former
> is not. If the partitioned table named in the query appears under UNION
> ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That
> means we have to use some other means to distinguish partitioned table
> RTEs that have an associated pcinfo from those that don't. So, I devised
> this method of looking at the parent RTE (if any) for distinguishing the
> two. Partitioned table named in the query either doesn't have the parent
> or if it does, the parent could only ever be a UNION ALL subquery
> (RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion
> will always have the parent and the parent will only ever be a table
> (RTE_RELATION).
>

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?

>
> In create_lateral_join_info():
>
> + Assert(IS_SIMPLE_REL(brel));
> + Assert(brte);
>
> The second Assert is either unnecessary or should be placed first.

simple_rte_array[] may have some NULL entries. Second assert makes
sure that we aren't dealing with a NULL entry. Any particular reason
to reorder the asserts?

>
> The following comment could be made a bit clearer.
>
> + * In the case of table inheritance, the parent RTE is directly
> linked
> + * to every child table via an AppendRelInfo. In the case of table
> + * partitioning, the inheritance hierarchy is expanded one level at a
> + * time rather than flattened. Therefore, an other member rel
> that is
> + * a partitioned table may have children of its own, and must
> + * therefore be marked with the appropriate lateral info so that
> those
> + * children eventually get marked also.
>
> How about: In the case of partitioned table inheritance, the original
> parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
> Partitions below the first level are accessible only via their immediate
> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
> consider those as well.

I don't see much difference between those two. We usually do not use
macros in comments, so usually comments mention "other member" rel.
Let's leave this for the committer to judge.

>
> In expand_inherited_rtentry(), the following comment fragment is obsolete,
> because we *do* now create AppendRelInfo's for partitioned children:
>
> + /*
> + * We keep a list of objects in root, each of which maps a
> partitioned
> + * parent RT index to the list of RT indexes of its partitioned child
> + * tables which do not have AppendRelInfos associated with those.

Good catch. I have reworded it as
/*
* We keep a list of objects in root, each of which maps a root
* partitioned parent RT index to the list of RT indexes of descendant
* partitioned child tables.

Does that look good?

>
>
> By the way, when we call expand_single_inheritance_child() in the
> non-partitioned inheritance case, we should pass NULL for childrte_p,
> childRTindex_p, childrc_p, instead of declaring variables that won't be
> used. Hence, expand_single_inheritance_child() should make those
> arguments optional.

That introduces an extra "if" condition, which is costlier than an
assignment. We have used both the styles in the code. Previously, I
have got comments otherwise. So, I am not sure.

>
> +
> + /*
> + * If the partitioned table has no partitions or all the partitions are
> + * temporary tables from other backends, treat this as non-inheritance
> + * case.
> + */
> + if (!has_child)
> + parentrte->inh = false;
>
> I guess the above applies to all partitioned tables in the tree, so, I
> think we should update the comment in set_rel_size():
>
> else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
> {
> /*
> * A partitioned table without leaf partitions is marked
> * as a dummy rel.
> */
> set_dummy_rel_pathlist(rel);
> }
>
> to say: a partitioned table without partitions is marked as a dummy rel.

Done. Thanks again for the catch.

I will update the patches once we have some resolution about 1. prti
in PlanRowMarks and 2. detection of root partitioned table in
add_paths_to_append_rel().

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 10:51:37
Message-ID:	CAFjFpRfAHDCrzr85FA93dmBeUOSsks=LC8dmJH7DJhigNDfs1g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 11, 2017 at 2:16 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/11 16:23, Ashutosh Bapat wrote:
>> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> I'm a bit suspicious about the fact that there are now executor
>>> changes related to the PlanRowMarks. If the rowmark's prti is now the
>>> intermediate parent's RT index rather than the top-parent's RT index,
>>> it'd seem like that'd matter somehow. Maybe it doesn't, because the
>>> code that cares about prti seems to only care about whether it's
>>> different from rti. But if that's true everywhere, then why even
>>> change this? I think we might be well off not to tinker with things
>>> that don't need to be changed.
>>
>> In the definition of ExecRowMark, I see
>> Index prti; /* parent range table index, if child */
>> It just says parent, by which I take as direct parent. For
>> inheritance, which earlier flattened inheritance hierarchy, direct
>> parent was top parent. So, probably nobody thought whether a parent is
>> direct parent or top parent. But now that we have introduced that
>> concept we need to interpret this comment anew. And I think
>> interpreting it as direct parent is non-lossy.
>
> But then we also don't have anything to say about why we're making that
> change. If you could describe what non-lossy is in this context, then
> fine.

By setting prti to the topmost parent RTI we are loosing information
that this child may be an intermediate child similar to what we did
earlier to AppendRelInfo. That's the lossy-ness in this context.

> But that we'd like to match with what we're going to do for
> AppendRelInfos does not seem to be a sufficient explanation for this change.

The purpose of this patch is to change the parent-child linkages for
partitioned table and prti is one of them. So, in fact, I am wondering
why not to change that along with AppendRelInfo.

>
>> If we set top parent's
>> index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
>> So, it looks quite natural that we set the direct parent's index in
>> PlanRowMark.
>
> They would not agree, yes, but aren't they unrelated? If we have a reason
> for them to agree, (for example, row-locking breaks in the inherited table
> case if we didn't), then we should definitely make them agree.
>
> Updating the comment for prti definition might be something that this
> patch could (should?) do, but I'm not quite sure about that too.
>

To me that looks backwards again for the reasons described above.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 11:49:11
Message-ID:	CA+TgmoZzjE98qXaKyeQgPjtugXdyKs2+Atwki2e2V9ZD+J2T4Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> So, all partitioned partitions are getting locked correctly. Am I
> missing something?

That's not a valid test. In that scenario, you're going to hold all
the locks acquired by the planner, all the locks acquired by the
rewriter, and all the locks acquired by the executor, but when using
prepared queries, it's possible to execute the plan after the planner
and rewriter locks are no longer held.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 12:07:13
Message-ID:	CAFjFpRfhtX60bvGukg7vS9jf+uHqw7uwAuZEkcVESQf68tgg9g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> So, all partitioned partitions are getting locked correctly. Am I
>> missing something?
>
> That's not a valid test. In that scenario, you're going to hold all
> the locks acquired by the planner, all the locks acquired by the
> rewriter, and all the locks acquired by the executor, but when using
> prepared queries, it's possible to execute the plan after the planner
> and rewriter locks are no longer held.
>

I see the same thing when I use prepare and execute

Session 1
postgres=# prepare stmt as select 1 from t1 union all select 2 from t1;
PREPARE
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
50912
(1 row)

postgres=# begin;
BEGIN
postgres=# execute stmt;
?column?
----------
(0 rows)

Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
locktype | relation | virtualxid | virtualtransaction | pid |
mode | granted | fastpath
------------+----------+------------+--------------------+-------+-----------------+---------+----------
relation | pg_locks | | 4/4 | 50914 |
AccessShareLock | t | t
virtualxid | | 4/4 | 4/4 | 50914 |
ExclusiveLock | t | t
relation | t1p1p1 | | 3/12 | 50912 |
AccessShareLock | t | t
relation | t1p1 | | 3/12 | 50912 |
AccessShareLock | t | t
relation | t1 | | 3/12 | 50912 |
AccessShareLock | t | t
virtualxid | | 3/12 | 3/12 | 50912 |
ExclusiveLock | t | t
(6 rows)

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-11 14:31:01
Message-ID:	CA+TgmoahYsThz6AWm-xyAkCz6ik_43hGnZNAvB=EzLMAG6wogg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 11, 2017 at 8:07 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I see the same thing when I use prepare and execute

Hmm. Well, that's good, but it doesn't prove there's no bug. We have
to understand where and why it's getting locked to know whether the
behavior will be correct in all cases. I haven't had time to look at
Amit's comments in detail yet so I don't know whether I agree with his
analysis or not, but we have to look at what's going on under the hood
to know whether the engine is working -- not just listen to the noise
it makes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 02:01:09
Message-ID:	17bb51b2-52bf-3a0b-d540-04051cf3e781@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/11 19:45, Ashutosh Bapat wrote:
> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>> IMHO, we should make it the responsibility of the future patch to set a
>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>> know that it's needed for something. We clearly know today why we need to
>> pass the other objects like child RT entry, RT index, and Relation, so we
>> should limit this patch to pass only those objects to the recursive call.
>> That makes this patch a relatively easy to understand change.
>
> I think you are mixing two issues here 1. setting parent RTI in child
> PlanRowMark and 2. passing immediate parent's PlanRowMark to
> expand_single_inheritance_child().
>
> I have discussed 1 in my reply to Robert.
>
> About 2 you haven't given any particular comments to my reply. To me
> it looks like it's this patch that introduces the notion of
> multi-level expansion, so it's natural for this patch to pass
> PlanRowMark in cascaded fashion similar to other structures.

You patch does 2 to be able to do 1, doesn't it? That is, to be able to
set the child PlanRowMark's prti to the direct parent's RT index, you pass
the immediate parent's PlanRowMark to the recursive call of
expand_single_inheritance_child().

All I am trying to say is that this patch's mission is to expand
inheritance step-wise to be able to do certain things in the *planner*
that weren't possible before. The patch accomplishes that by creating
child AppendRelInfos such that its parent_relid field is set to the
immediate parent's RT index. It's quite clear why we're doing so. It's
not clear why we should do so for PlanRowMarks too. Maybe it's fine as
long as nothing breaks.

>> If we go with your patch, partitioned tables won't get locked, for
>> example, in case of the following query (p is a partitioned table):
>>
>> select 1 from p union all select 2 from p;
>>
>> That's because the RelOptInfos for the two instances of p in the above
>> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
>> of the Append corresponding to the UNION ALL subquery RTE. So,
>> partitioned_rels does not get set per your proposed code.
>

[...]

> So, all partitioned partitions are getting locked correctly. Am I
> missing something?

Will reply to this separately to your other email.

> Actually, the original problem that caused this discussion started
> with an assertion failure in get_partitioned_child_rels() as
> Assert(list_length(result) >= 1);
>
> This assertion fails if result is NIL when an intermediate partitioned
> table is passed. May be we should assert (result == NIL ||
> list_length(result) == 1) and allow that function to be called even
> for intermediate partitioned partitions for which the function will
> return NIL. That will leave the code in add_paths_to_append_rel()
> simple. Thoughts?

Yeah, I guess that could work. We'll just have to write comments to
describe why the Assert is written that way.

>> In create_lateral_join_info():
>>
>> + Assert(IS_SIMPLE_REL(brel));
>> + Assert(brte);
>>
>> The second Assert is either unnecessary or should be placed first.
>
> simple_rte_array[] may have some NULL entries. Second assert makes
> sure that we aren't dealing with a NULL entry. Any particular reason
> to reorder the asserts?

Sorry, I missed that the 2nd Assert has b"rte". I thought it's b"rel".

>> The following comment could be made a bit clearer.
>>
>> + * In the case of table inheritance, the parent RTE is directly
>> linked
>> + * to every child table via an AppendRelInfo. In the case of table
>> + * partitioning, the inheritance hierarchy is expanded one level at a
>> + * time rather than flattened. Therefore, an other member rel
>> that is
>> + * a partitioned table may have children of its own, and must
>> + * therefore be marked with the appropriate lateral info so that
>> those
>> + * children eventually get marked also.
>>
>> How about: In the case of partitioned table inheritance, the original
>> parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
>> Partitions below the first level are accessible only via their immediate
>> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
>> consider those as well.
>
> I don't see much difference between those two. We usually do not use
> macros in comments, so usually comments mention "other member" rel.
> Let's leave this for the committer to judge.

Sure.

>> In expand_inherited_rtentry(), the following comment fragment is obsolete,
>> because we *do* now create AppendRelInfo's for partitioned children:
>>
>> + /*
>> + * We keep a list of objects in root, each of which maps a
>> partitioned
>> + * parent RT index to the list of RT indexes of its partitioned child
>> + * tables which do not have AppendRelInfos associated with those.
>
> Good catch. I have reworded it as
> /*
> * We keep a list of objects in root, each of which maps a root
> * partitioned parent RT index to the list of RT indexes of descendant
> * partitioned child tables.
>
> Does that look good?

Looks fine.

>> By the way, when we call expand_single_inheritance_child() in the
>> non-partitioned inheritance case, we should pass NULL for childrte_p,
>> childRTindex_p, childrc_p, instead of declaring variables that won't be
>> used. Hence, expand_single_inheritance_child() should make those
>> arguments optional.
>
> That introduces an extra "if" condition, which is costlier than an
> assignment. We have used both the styles in the code. Previously, I
> have got comments otherwise. So, I am not sure.

OK. expand_single_inheritance_child's header comment does not mention the
new result fields. Maybe add a comment describing what their role is and
that they're not optional arguments.

> I will update the patches once we have some resolution about 1. prti
> in PlanRowMarks and 2. detection of root partitioned table in
> add_paths_to_append_rel().

OK.

About 2, I somewhat agree with your proposed solution above, which might
be simpler to explain in comments than the code I proposed.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 07:39:02
Message-ID:	CAFjFpRdZOv5-s5na6nptU=k3r_EOYz20jxwMbOkVxCc8wtSoug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/11 19:45, Ashutosh Bapat wrote:
>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>>> IMHO, we should make it the responsibility of the future patch to set a
>>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>>> know that it's needed for something. We clearly know today why we need to
>>> pass the other objects like child RT entry, RT index, and Relation, so we
>>> should limit this patch to pass only those objects to the recursive call.
>>> That makes this patch a relatively easy to understand change.
>>
>> I think you are mixing two issues here 1. setting parent RTI in child
>> PlanRowMark and 2. passing immediate parent's PlanRowMark to
>> expand_single_inheritance_child().
>>
>> I have discussed 1 in my reply to Robert.
>>
>> About 2 you haven't given any particular comments to my reply. To me
>> it looks like it's this patch that introduces the notion of
>> multi-level expansion, so it's natural for this patch to pass
>> PlanRowMark in cascaded fashion similar to other structures.
>
> You patch does 2 to be able to do 1, doesn't it? That is, to be able to
> set the child PlanRowMark's prti to the direct parent's RT index, you pass
> the immediate parent's PlanRowMark to the recursive call of
> expand_single_inheritance_child().

No. child PlanRowMark's prti is set to parentRTIndex, which is a
separate argument and is used to also set parent_relid in
AppendRelInfo.

>
>> Actually, the original problem that caused this discussion started
>> with an assertion failure in get_partitioned_child_rels() as
>> Assert(list_length(result) >= 1);
>>
>> This assertion fails if result is NIL when an intermediate partitioned
>> table is passed. May be we should assert (result == NIL ||
>> list_length(result) == 1) and allow that function to be called even
>> for intermediate partitioned partitions for which the function will
>> return NIL. That will leave the code in add_paths_to_append_rel()
>> simple. Thoughts?
>
> Yeah, I guess that could work. We'll just have to write comments to
> describe why the Assert is written that way.
>
>>> By the way, when we call expand_single_inheritance_child() in the
>>> non-partitioned inheritance case, we should pass NULL for childrte_p,
>>> childRTindex_p, childrc_p, instead of declaring variables that won't be
>>> used. Hence, expand_single_inheritance_child() should make those
>>> arguments optional.
>>
>> That introduces an extra "if" condition, which is costlier than an
>> assignment. We have used both the styles in the code. Previously, I
>> have got comments otherwise. So, I am not sure.
>
> OK. expand_single_inheritance_child's header comment does not mention the
> new result fields. Maybe add a comment describing what their role is and
> that they're not optional arguments.
>
>> I will update the patches once we have some resolution about 1. prti
>> in PlanRowMarks and 2. detection of root partitioned table in
>> add_paths_to_append_rel().
>
> OK.
>
> About 2, I somewhat agree with your proposed solution above, which might
> be simpler to explain in comments than the code I proposed.

After testing a few queries I am getting a feeling that
ExecLockNonLeafAppendTables isn't really locking anything. I will
write more about that in my reply to Robert's mail.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 07:46:58
Message-ID:	040c1322-cf92-ab47-5246-dae08d5b6f7a@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/11 21:07, Ashutosh Bapat wrote:
> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> So, all partitioned partitions are getting locked correctly. Am I
>>> missing something?
>>
>> That's not a valid test. In that scenario, you're going to hold all
>> the locks acquired by the planner, all the locks acquired by the
>> rewriter, and all the locks acquired by the executor, but when using
>> prepared queries, it's possible to execute the plan after the planner
>> and rewriter locks are no longer held.
>
> I see the same thing when I use prepare and execute

So I looked at this a bit closely and came to the conclusion that we may
not need to keep partitioned table RT indexes in the
(Merge)Append.partitioned_rels after all, as far as execution-time locking
is concerned.

Consider two cases:

1. Plan is created and executed in the same transaction

In this case, locks taken on the partitioned tables by the planner will
suffice.

2. Plan is executed in a different transaction from the one in which it
was created (a cached plan)

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Moreover, because all the tables from plannedstmt->rtable, including the
partitioned tables, will be added to PlannedStmt.relationsOids, any
invalidation events affecting the partitioned tables (for example,
add/remove a partition) will cause the plan involving partitioned tables
to be recreated.

In none of this do we rely on the partitioned table RT indexes appearing
in the (Merge)Append node itself. Maybe, we should just remove
partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
separate patch and move on.

Thoughts?

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 07:55:21
Message-ID:	CAFjFpRcb8yiH9=yNFkOqWBWmiNrJoEUN+oP9SBtsf4fieqq4Fw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/11 21:07, Ashutosh Bapat wrote:
>> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
>>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>>> So, all partitioned partitions are getting locked correctly. Am I
>>>> missing something?
>>>
>>> That's not a valid test. In that scenario, you're going to hold all
>>> the locks acquired by the planner, all the locks acquired by the
>>> rewriter, and all the locks acquired by the executor, but when using
>>> prepared queries, it's possible to execute the plan after the planner
>>> and rewriter locks are no longer held.
>>
>> I see the same thing when I use prepare and execute
>
> So I looked at this a bit closely and came to the conclusion that we may
> not need to keep partitioned table RT indexes in the
> (Merge)Append.partitioned_rels after all, as far as execution-time locking
> is concerned.
>
> Consider two cases:
>
> 1. Plan is created and executed in the same transaction
>
> In this case, locks taken on the partitioned tables by the planner will
> suffice.
>
> 2. Plan is executed in a different transaction from the one in which it
> was created (a cached plan)
>
> In this case, AcquireExecutorLocks will lock all the relations in
> PlannedStmt.rtable, which must include all partitioned tables of all
> partition trees involved in the query. Of those, it will lock the tables
> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
> list of all partitioned table RT indexes obtained by concatenating
> partitioned_rels lists of all ModifyTable nodes involved in the query
> (set_plan_refs does that). We need to distinguish nonleafResultRelations,
> because we need to take the stronger lock on a given table before any
> weaker one if it happens to appear in the query as a non-result relation
> too, to avoid lock strength upgrade deadlock hazard.
>
> Moreover, because all the tables from plannedstmt->rtable, including the
> partitioned tables, will be added to PlannedStmt.relationsOids, any
> invalidation events affecting the partitioned tables (for example,
> add/remove a partition) will cause the plan involving partitioned tables
> to be recreated.
>
> In none of this do we rely on the partitioned table RT indexes appearing
> in the (Merge)Append node itself. Maybe, we should just remove
> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
> separate patch and move on.
>
> Thoughts?

Yes, I did the same analysis (to which I refer in my earlier reply to
you). I too think we should just remove partitioned_rels from Append
paths. But then the question is those are then transferred to
ModifyTable node in create_modifytable_plan() and use it for something
else. What should we do about that code? I don't think we are really
using that list from ModifyTable node as well, so may be we could
remove it from there as well. What do you think? Does that mean
partitioned_rels isn't used at all in the code?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 08:12:23
Message-ID:	bb0ca046-0780-53d0-66ae-4d06162d8fd8@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/12 16:55, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:
>> So I looked at this a bit closely and came to the conclusion that we may
>> not need to keep partitioned table RT indexes in the
>> (Merge)Append.partitioned_rels after all, as far as execution-time locking
>> is concerned.
>>
>> Consider two cases:
>>
>> 1. Plan is created and executed in the same transaction
>>
>> In this case, locks taken on the partitioned tables by the planner will
>> suffice.
>>
>> 2. Plan is executed in a different transaction from the one in which it
>> was created (a cached plan)
>>
>> In this case, AcquireExecutorLocks will lock all the relations in
>> PlannedStmt.rtable, which must include all partitioned tables of all
>> partition trees involved in the query. Of those, it will lock the tables
>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
>> list of all partitioned table RT indexes obtained by concatenating
>> partitioned_rels lists of all ModifyTable nodes involved in the query
>> (set_plan_refs does that). We need to distinguish nonleafResultRelations,
>> because we need to take the stronger lock on a given table before any
>> weaker one if it happens to appear in the query as a non-result relation
>> too, to avoid lock strength upgrade deadlock hazard.
>>
>> Moreover, because all the tables from plannedstmt->rtable, including the
>> partitioned tables, will be added to PlannedStmt.relationsOids, any
>> invalidation events affecting the partitioned tables (for example,
>> add/remove a partition) will cause the plan involving partitioned tables
>> to be recreated.
>>
>> In none of this do we rely on the partitioned table RT indexes appearing
>> in the (Merge)Append node itself. Maybe, we should just remove
>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
>> separate patch and move on.
>>
>> Thoughts?
>
> Yes, I did the same analysis (to which I refer in my earlier reply to
> you). I too think we should just remove partitioned_rels from Append
> paths. But then the question is those are then transferred to
> ModifyTable node in create_modifytable_plan() and use it for something
> else. What should we do about that code? I don't think we are really
> using that list from ModifyTable node as well, so may be we could
> remove it from there as well. What do you think? Does that mean
> partitioned_rels isn't used at all in the code?

No, we cannot simply get rid of partitioned_rels altogether. We'll need
to keep it in the ModifyTable node, because we *do* need the
nonleafResultRelations list in PlannedStmt to distinguish partitioned
table result relations, which set_plan_refs builds by concatenating
partitioned_rels lists of various ModifyTable nodes of the query. The
PlannedStmt.nonleafResultRelations list actually has some use (which
parallels PlannedStmt.resultRelations), but partitioned_rels list in the
individual (Merge)Append, as it turns out, doesn't.

So, we can remove partitioned_rels from (Merge)AppendPath and
(Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Thanks,
Amit

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 08:47:04
Message-ID:	8f87e8e6-5c28-1ade-ad1d-f6ba07f76f0a@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/12 16:39, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> On 2017/09/11 19:45, Ashutosh Bapat wrote:
>>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>>>> IMHO, we should make it the responsibility of the future patch to set a
>>>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>>>> know that it's needed for something. We clearly know today why we need to
>>>> pass the other objects like child RT entry, RT index, and Relation, so we
>>>> should limit this patch to pass only those objects to the recursive call.
>>>> That makes this patch a relatively easy to understand change.
>>>
>>> I think you are mixing two issues here 1. setting parent RTI in child
>>> PlanRowMark and 2. passing immediate parent's PlanRowMark to
>>> expand_single_inheritance_child().
>>>
>>> I have discussed 1 in my reply to Robert.
>>>
>>> About 2 you haven't given any particular comments to my reply. To me
>>> it looks like it's this patch that introduces the notion of
>>> multi-level expansion, so it's natural for this patch to pass
>>> PlanRowMark in cascaded fashion similar to other structures.
>>
>> You patch does 2 to be able to do 1, doesn't it? That is, to be able to
>> set the child PlanRowMark's prti to the direct parent's RT index, you pass
>> the immediate parent's PlanRowMark to the recursive call of
>> expand_single_inheritance_child().
>
> No. child PlanRowMark's prti is set to parentRTIndex, which is a
> separate argument and is used to also set parent_relid in
> AppendRelInfo.

OK. So, to keep the old behavior (if at all), we'd actually need a new
argument rootParentRTindex. Old behavior being that all child
PlanRowMarks has the rootParentRTindex as their prti.

It seems though that the new behavior where prti will now be set to the
direct parent's RT index is more or less harmless, because whatever we set
prti to, as long as it's different from rti, we can consider it a child
PlanRowMark. So it might be fine to set prti to direct parent's RT index.

That said, I noticed that we might need to be careful about what the value
of the root parent's PlanRowMark's allMarkType field gets set to. We need
to make sure that it reflects markType of all partitions in the tree,
including those that are not root parent's direct children. Is that true
with the proposed implementation?

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 08:53:31
Message-ID:	CAFjFpRfWu3pVXv5DU0VgSh9JMwwzzWbzxXiPnpxVwVqRn82wyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/12 16:55, Ashutosh Bapat wrote:
>> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:
>>> So I looked at this a bit closely and came to the conclusion that we may
>>> not need to keep partitioned table RT indexes in the
>>> (Merge)Append.partitioned_rels after all, as far as execution-time locking
>>> is concerned.
>>>
>>> Consider two cases:
>>>
>>> 1. Plan is created and executed in the same transaction
>>>
>>> In this case, locks taken on the partitioned tables by the planner will
>>> suffice.
>>>
>>> 2. Plan is executed in a different transaction from the one in which it
>>> was created (a cached plan)
>>>
>>> In this case, AcquireExecutorLocks will lock all the relations in
>>> PlannedStmt.rtable, which must include all partitioned tables of all
>>> partition trees involved in the query. Of those, it will lock the tables
>>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
>>> list of all partitioned table RT indexes obtained by concatenating
>>> partitioned_rels lists of all ModifyTable nodes involved in the query
>>> (set_plan_refs does that). We need to distinguish nonleafResultRelations,
>>> because we need to take the stronger lock on a given table before any
>>> weaker one if it happens to appear in the query as a non-result relation
>>> too, to avoid lock strength upgrade deadlock hazard.
>>>
>>> Moreover, because all the tables from plannedstmt->rtable, including the
>>> partitioned tables, will be added to PlannedStmt.relationsOids, any
>>> invalidation events affecting the partitioned tables (for example,
>>> add/remove a partition) will cause the plan involving partitioned tables
>>> to be recreated.
>>>
>>> In none of this do we rely on the partitioned table RT indexes appearing
>>> in the (Merge)Append node itself. Maybe, we should just remove
>>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
>>> separate patch and move on.
>>>
>>> Thoughts?
>>
>> Yes, I did the same analysis (to which I refer in my earlier reply to
>> you). I too think we should just remove partitioned_rels from Append
>> paths. But then the question is those are then transferred to
>> ModifyTable node in create_modifytable_plan() and use it for something
>> else. What should we do about that code? I don't think we are really
>> using that list from ModifyTable node as well, so may be we could
>> remove it from there as well. What do you think? Does that mean
>> partitioned_rels isn't used at all in the code?
>
> No, we cannot simply get rid of partitioned_rels altogether. We'll need
> to keep it in the ModifyTable node, because we *do* need the
> nonleafResultRelations list in PlannedStmt to distinguish partitioned
> table result relations, which set_plan_refs builds by concatenating
> partitioned_rels lists of various ModifyTable nodes of the query. The
> PlannedStmt.nonleafResultRelations list actually has some use (which
> parallels PlannedStmt.resultRelations), but partitioned_rels list in the
> individual (Merge)Append, as it turns out, doesn't.
>
> So, we can remove partitioned_rels from (Merge)AppendPath and
> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Don't we need partitioned_rels from Append paths to be transferred to
ModifyTable node or we have a different way of calculating
nonleafResultRelations?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 09:05:03
Message-ID:	09b8e4a7-cf72-8528-0e05-a13d92453ab9@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/12 17:53, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:
>> So, we can remove partitioned_rels from (Merge)AppendPath and
>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().
>
> Don't we need partitioned_rels from Append paths to be transferred to
> ModifyTable node or we have a different way of calculating
> nonleafResultRelations?

No, we don't transfer partitioned_rels from Append path to ModifyTable
node. inheritance_planner(), that builds the ModifyTable path for
UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
root->pcinfo_list itself and passes it to create_modifytable_path. No
Append path is involved in that case. PlannedStmt.nonleafResultRelations
is built by concatenating the partitioned_rels lists of all ModifyTable
nodes appearing in the query. It does not depend on Append's or
AppendPath's partitioned_rels.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 09:49:38
Message-ID:	CAFjFpRf7COj1buQbn2f=9+0nHaQP3DmMipTK2CXQ6iZOrcdFPQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> That said, I noticed that we might need to be careful about what the value
> of the root parent's PlanRowMark's allMarkType field gets set to. We need
> to make sure that it reflects markType of all partitions in the tree,
> including those that are not root parent's direct children. Is that true
> with the proposed implementation?

Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
top parent's PlanRowMarks should have all descendant's allMarkTypes,
which is not happening in the patch right now. There are two ways to
fix that.

1. Pass top parent's PlanRowMark all the way down to the leaf
partitions, so that current expand_single_inheritance_child() collects
allMarkTypes of all children correctly. But this way, PlanRowMarks of
intermediate parent does not reflect allMarkTypes of its children,
only top root records that.
2. Pass immediate parent's PlanRowMark to
expand_single_inheritance_child(), so that it records allMarkTypes of
its children. In expand_partitioned_rtentry() have following sequence

expand_single_inheritance_child(root, parentrte, parentRTindex,
parentrel, parentrc, childrel,
appinfos, &childrte, &childRTindex,
&childrc);

/* If this child is itself partitioned, recurse */
if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
expand_partitioned_rtentry(root, childrte, childRTindex,
childrel, childrc, lockmode, appinfos,
partitioned_child_rels);

/* Include child's rowmark type in parent's allMarkTypes */
parentrc->allMarkTypes |= childrc->allMarkTypes;
}
so that we push allMarkTypes up the hierarchy.

I like the second way, since every intermediate parent records
allMarkTypes of its descendants.

Thoughts?
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 10:04:54
Message-ID:	7ebdbe52-1918-49dc-4ca2-5785c4aab26a@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/12 18:49, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>
>> That said, I noticed that we might need to be careful about what the value
>> of the root parent's PlanRowMark's allMarkType field gets set to. We need
>> to make sure that it reflects markType of all partitions in the tree,
>> including those that are not root parent's direct children. Is that true
>> with the proposed implementation?
>
> Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
> top parent's PlanRowMarks should have all descendant's allMarkTypes,
> which is not happening in the patch right now. There are two ways to
> fix that.
>
> 1. Pass top parent's PlanRowMark all the way down to the leaf
> partitions, so that current expand_single_inheritance_child() collects
> allMarkTypes of all children correctly. But this way, PlanRowMarks of
> intermediate parent does not reflect allMarkTypes of its children,
> only top root records that.
> 2. Pass immediate parent's PlanRowMark to
> expand_single_inheritance_child(), so that it records allMarkTypes of
> its children. In expand_partitioned_rtentry() have following sequence
>
> expand_single_inheritance_child(root, parentrte, parentRTindex,
> parentrel, parentrc, childrel,
> appinfos, &childrte, &childRTindex,
> &childrc);
>
> /* If this child is itself partitioned, recurse */
> if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> {
> expand_partitioned_rtentry(root, childrte, childRTindex,
> childrel, childrc, lockmode, appinfos,
> partitioned_child_rels);
>
> /* Include child's rowmark type in parent's allMarkTypes */
> parentrc->allMarkTypes |= childrc->allMarkTypes;
> }
> so that we push allMarkTypes up the hierarchy.
>
> I like the second way, since every intermediate parent records
> allMarkTypes of its descendants.

I like the second way, too.

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 10:56:38
Message-ID:	CAFjFpRe62H0rTb4Rb7wOVSR25xfNW+mt1Ncp-OtzGaEtZBTLwA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 12, 2017 at 2:35 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/12 17:53, Ashutosh Bapat wrote:
>> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:
>>> So, we can remove partitioned_rels from (Merge)AppendPath and
>>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().
>>
>> Don't we need partitioned_rels from Append paths to be transferred to
>> ModifyTable node or we have a different way of calculating
>> nonleafResultRelations?
>
> No, we don't transfer partitioned_rels from Append path to ModifyTable
> node. inheritance_planner(), that builds the ModifyTable path for
> UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
> root->pcinfo_list itself and passes it to create_modifytable_path. No
> Append path is involved in that case. PlannedStmt.nonleafResultRelations
> is built by concatenating the partitioned_rels lists of all ModifyTable
> nodes appearing in the query. It does not depend on Append's or
> AppendPath's partitioned_rels.

Ok. Thanks for the explanation.

This make me examine inheritance_planner() closely and I think I have
spotted a thinko there. In inheritance_planner() parent_rte is set to
the RTE of parent to start with and then in the loop
1132 /*
1133 * And now we can get on with generating a plan for each child table.
1134 */
1135 foreach(lc, root->append_rel_list)
1136 {
... code clipped
1165 /*
1166 * If there are securityQuals attached to the parent,
move them to the
1167 * child rel (they've already been transformed properly for that).
1168 */
1169 parent_rte = rt_fetch(parentRTindex, subroot->parse->rtable);
1170 child_rte = rt_fetch(appinfo->child_relid, subroot->parse->rtable);
1171 child_rte->securityQuals = parent_rte->securityQuals;
1172 parent_rte->securityQuals = NIL;

we set parent_rte to the one obtained from subroot->parse, which
happens to be the same (at least in contents) as original parent_rte.
Later we use this parent_rte to pull partitioned_rels outside that
loop

1371 if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
1372 {
1373 partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
1374 /* The root partitioned table is included as a child rel */
1375 Assert(list_length(partitioned_rels) >= 1);
1376 }

I think the code here expects the original parent_rte and not the one
we set around line 1169.

This isn't a bug right now, since both the parent_rte s have same
content. But I am not sure if that will remain to be so. Here's patch
to fix the thinko.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
inh_planner_prte.patch	application/octet-stream	1.1 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-12 19:09:38
Message-ID:	CA+TgmobEygfb9wdz0nBZF-pCoVaoQK0w2uKfRsb2yjiq4WZ0sg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> In this case, AcquireExecutorLocks will lock all the relations in
> PlannedStmt.rtable, which must include all partitioned tables of all
> partition trees involved in the query. Of those, it will lock the tables
> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
> list of all partitioned table RT indexes obtained by concatenating
> partitioned_rels lists of all ModifyTable nodes involved in the query
> (set_plan_refs does that). We need to distinguish nonleafResultRelations,
> because we need to take the stronger lock on a given table before any
> weaker one if it happens to appear in the query as a non-result relation
> too, to avoid lock strength upgrade deadlock hazard.

Hmm. The problem with this theory in my view is that it doesn't
explain why InitPlan() and ExecOpenScanRelation() lock the relations
instead of just assuming that they are already locked either by
AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables()
doesn't really need to take locks, then ExecOpenScanRelation() must
not need to do it either. We invented ExecLockNonLeafAppendTables()
on the occasion of removing the scans of those tables which would
previously have caused ExecOpenScanRelation() to be invoked, so as to
keep the locking behavior unchanged.

AcquireExecutorLocks() looks like an odd bit of code to me. The
executor itself locks result tables in InitPlan() and then everything
else during InitPlan() and all of the others later on while walking
the plan tree -- comments in InitPlan() say that this is to avoid a
lock upgrade hazard if a result rel is also a source rel. But
AcquireExecutorLocks() has no such provision; it just locks everything
in RTE order. In theory, that's a deadlock hazard of another kind, as
we just talked about in the context of EIBO. In fact, expanding in
bound order has made the situation worse: before, expansion order and
locking order were the same, so maybe having AcquireExecutorLocks()
work in RTE order coincidentally happened to give the same result as
the executor code itself as long as there are no result relations.
But this is certainly not true any more. I'm not sure it's worth
expending a lot of time on this -- it's evidently not a problem in
practice, or somebody probably would've complained before now.

But that having been said, I don't think we should assume that all the
locks taken from the executor are worthless because plancache.c will
always do the job for us. I don't know of a case where we execute a
saved plan without going through the plan cache, but that doesn't mean
that there isn't one or that there couldn't be one in the future.
It's not the job of these partitioning patches to whack around the way
we do locking in general -- they should preserve the existing behavior
as much as possible. If we want to get rid of the locking in the
executor altogether, that's a separate discussion where, I have a
feeling, there will prove to be better reasons for the way things are
than we are right now supposing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 05:59:53
Message-ID:	fad6c239-e8d9-9bdd-14f9-0a3d44bec36e@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/12 19:56, Ashutosh Bapat wrote:
> I think the code here expects the original parent_rte and not the one
> we set around line 1169.
>
> This isn't a bug right now, since both the parent_rte s have same
> content. But I am not sure if that will remain to be so. Here's patch
> to fix the thinko.

Instead of the new bool is_parent_partitioned, why not move the code to
set partitioned_rels to the block where you're now setting
is_parent_partitioned.

Also, since we know this isn't a bug at the moment but will turn into one
once we have step-wise expansion, why not include this fix in that patch
itself?

Thanks,
Amit

From:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 07:02:22
Message-ID:	CAJ3gD9ctVgv6r0-7B6js7Z5uPHXx+KA5jK-3=uFsGwKOXfTddg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Rafia had done some testing on TPCH queries using Partition-wise join
patch along with Parallel Append patch.

There, we had observed that for query 4, even though the partition
wise joins are under a Parallel Append, the join are all non-partial.

Specifically, the partition-wise join has non-partial nested loop
joins when actually it was expected to have partial nested loop joins.
(The difference can be seen by the observation that the outer relation
of that join is scanned by non-parallel Bitmap Heap scan when it
should have used Parallel Bitmap Heap Scan).

Here is the detailed analysis , including where I think is the issue :

https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com

All the TPCH results are posted in the same above mail thread.

Thanks
-Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 07:21:36
Message-ID:	CAFjFpRfA464OVY25R4eC3MZPwbmiV=1fsAqLX_rG0Wrj5_NEgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> In this case, AcquireExecutorLocks will lock all the relations in
>> PlannedStmt.rtable, which must include all partitioned tables of all
>> partition trees involved in the query. Of those, it will lock the tables
>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
>> list of all partitioned table RT indexes obtained by concatenating
>> partitioned_rels lists of all ModifyTable nodes involved in the query
>> (set_plan_refs does that). We need to distinguish nonleafResultRelations,
>> because we need to take the stronger lock on a given table before any
>> weaker one if it happens to appear in the query as a non-result relation
>> too, to avoid lock strength upgrade deadlock hazard.
>
> Hmm. The problem with this theory in my view is that it doesn't
> explain why InitPlan() and ExecOpenScanRelation() lock the relations
> instead of just assuming that they are already locked either by
> AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables()
> doesn't really need to take locks, then ExecOpenScanRelation() must
> not need to do it either. We invented ExecLockNonLeafAppendTables()
> on the occasion of removing the scans of those tables which would
> previously have caused ExecOpenScanRelation() to be invoked, so as to
> keep the locking behavior unchanged.
>
> AcquireExecutorLocks() looks like an odd bit of code to me. The
> executor itself locks result tables in InitPlan() and then everything
> else during InitPlan() and all of the others later on while walking
> the plan tree -- comments in InitPlan() say that this is to avoid a
> lock upgrade hazard if a result rel is also a source rel. But
> AcquireExecutorLocks() has no such provision; it just locks everything
> in RTE order. In theory, that's a deadlock hazard of another kind, as
> we just talked about in the context of EIBO. In fact, expanding in
> bound order has made the situation worse: before, expansion order and
> locking order were the same, so maybe having AcquireExecutorLocks()
> work in RTE order coincidentally happened to give the same result as
> the executor code itself as long as there are no result relations.
> But this is certainly not true any more. I'm not sure it's worth
> expending a lot of time on this -- it's evidently not a problem in
> practice, or somebody probably would've complained before now.
>
> But that having been said, I don't think we should assume that all the
> locks taken from the executor are worthless because plancache.c will
> always do the job for us. I don't know of a case where we execute a
> saved plan without going through the plan cache, but that doesn't mean
> that there isn't one or that there couldn't be one in the future.
> It's not the job of these partitioning patches to whack around the way
> we do locking in general -- they should preserve the existing behavior
> as much as possible. If we want to get rid of the locking in the
> executor altogether, that's a separate discussion where, I have a
> feeling, there will prove to be better reasons for the way things are
> than we are right now supposing.
>

I agree that it's not the job of these patches to change the locking
or even get rid of partitioned_rels. In order to continue returning
partitioned_rels in Append paths esp. in the case of queries involving
set operations and partitioned table e.g "select 1 from t1 union all
select 2 from t1;" in which t1 is multi-level partitioned table, we
need a fix in add_paths_to_append_rels(). The fix provided in [1] is
correct but we will need a longer explanation of why we have to
involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
is complicated. If we get rid of partitioned_rels, we don't need to
fix that code in add_paths_to_append_rel().

I suggested that [2]
-- (excerpt from [2])

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

Amit Langote agrees with this. It kind of makes the assertion lame but
keeps the code sane. What do you think?

[1] https://www.postgresql.org/message-id/d2f1cdcb-ebb4-76c5-e471-79348ca5d7a7@lab.ntt.co.jp
[2] https://www.postgresql.org/message-id/CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 07:30:17
Message-ID:	CAFjFpRdujTaE+qvQN1m_2EQactnY8sEcxLRzwTOWjHpDHSd=vg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2017 at 11:29 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/12 19:56, Ashutosh Bapat wrote:
>> I think the code here expects the original parent_rte and not the one
>> we set around line 1169.
>>
>> This isn't a bug right now, since both the parent_rte s have same
>> content. But I am not sure if that will remain to be so. Here's patch
>> to fix the thinko.
>
> Instead of the new bool is_parent_partitioned, why not move the code to
> set partitioned_rels to the block where you're now setting
> is_parent_partitioned.
>
> Also, since we know this isn't a bug at the moment but will turn into one
> once we have step-wise expansion, why not include this fix in that patch
> itself?

It won't turn into a bug with step-wise expansion since every
parent_rte will have RELKIND_PARTITIONED_TABLE for a partitioned top
parent, which is used to extract the partitioned_rels. But I guess,
it's better to fix the thinko in step-wise expansion since parent_rte
itself changes.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 07:35:51
Message-ID:	CAFjFpRdfa9aCqV5Bv-1Biuurzexw00yQJ7v_Wn4VDLdMnrNv8w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> Hi,
>
> Rafia had done some testing on TPCH queries using Partition-wise join
> patch along with Parallel Append patch.
>
> There, we had observed that for query 4, even though the partition
> wise joins are under a Parallel Append, the join are all non-partial.
>
> Specifically, the partition-wise join has non-partial nested loop
> joins when actually it was expected to have partial nested loop joins.
> (The difference can be seen by the observation that the outer relation
> of that join is scanned by non-parallel Bitmap Heap scan when it
> should have used Parallel Bitmap Heap Scan).
>
> Here is the detailed analysis , including where I think is the issue :
>
> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com
>
> All the TPCH results are posted in the same above mail thread.

Can you please check if the attached patch fixes the issue.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
reparameterize_partial_nestloop_inner.patch	text/x-patch	3.0 KB

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 07:38:09
Message-ID:	602a2d63-fd1d-53c3-687e-09eb9b282eed@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/13 16:21, Ashutosh Bapat wrote:
> On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> locks taken from the executor are worthless because plancache.c will
>> always do the job for us. I don't know of a case where we execute a
>> saved plan without going through the plan cache, but that doesn't mean
>> that there isn't one or that there couldn't be one in the future.
>> It's not the job of these partitioning patches to whack around the way
>> we do locking in general -- they should preserve the existing behavior
>> as much as possible. If we want to get rid of the locking in the
>> executor altogether, that's a separate discussion where, I have a
>> feeling, there will prove to be better reasons for the way things are
>> than we are right now supposing.
>>
>
> I agree that it's not the job of these patches to change the locking
> or even get rid of partitioned_rels. In order to continue returning
> partitioned_rels in Append paths esp. in the case of queries involving
> set operations and partitioned table e.g "select 1 from t1 union all
> select 2 from t1;" in which t1 is multi-level partitioned table, we
> need a fix in add_paths_to_append_rels(). The fix provided in [1] is
> correct but we will need a longer explanation of why we have to
> involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
> is complicated. If we get rid of partitioned_rels, we don't need to
> fix that code in add_paths_to_append_rel().

Yeah, let's get on with setting partitioned_rels in AppendPath correctly
in this patch. Ashutosh's suggested approach seems fine, although it
needlessly requires to scan root->pcinfo_list. But it shouldn't be longer
than the number of partitioned tables in the query, so maybe that's fine
too. At least, it doesn't require us to add code to
add_paths_to_append_rel() that can be pretty hard to wrap one's head around.

That said, we might someday need to look carefully at some things that
Robert mentioned carefully, especially around the order of locks taken by
AcquireExecutorLocks() in light of the EIBO patch getting committed.

Thanks,
Amit

From:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 09:52:02
Message-ID:	CAJ3gD9cHpzyeAVxDJxYTs3ghQ1jUQJPznW6jjvQ0p1Jkmx6eMw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13 September 2017 at 13:05, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> Hi,
>>
>> Rafia had done some testing on TPCH queries using Partition-wise join
>> patch along with Parallel Append patch.
>>
>> There, we had observed that for query 4, even though the partition
>> wise joins are under a Parallel Append, the join are all non-partial.
>>
>> Specifically, the partition-wise join has non-partial nested loop
>> joins when actually it was expected to have partial nested loop joins.
>> (The difference can be seen by the observation that the outer relation
>> of that join is scanned by non-parallel Bitmap Heap scan when it
>> should have used Parallel Bitmap Heap Scan).
>>
>> Here is the detailed analysis , including where I think is the issue :
>>
>> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com
>>
>> All the TPCH results are posted in the same above mail thread.
>
> Can you please check if the attached patch fixes the issue.

Thanks Ashutosh. Yes, it does fix the issue. Partial Nested Loop joins
are generated now. If I see any unexpected differences in the
estimated or actual costs, I will report that in the Parallel Append
thread. As far as Partition-wise join is concerned, this issue is
solved, because Partial nested loop join does get created.

>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 16:56:17
Message-ID:	CAFjFpRdHb_ZnoDTuBXqrudWXh3H1ibLkr6nHsCFT96fSK4DXtA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2017 at 12:51 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> In this case, AcquireExecutorLocks will lock all the relations in
>>> PlannedStmt.rtable, which must include all partitioned tables of all
>>> partition trees involved in the query. Of those, it will lock the tables
>>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>>> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
>>> list of all partitioned table RT indexes obtained by concatenating
>>> partitioned_rels lists of all ModifyTable nodes involved in the query
>>> (set_plan_refs does that). We need to distinguish nonleafResultRelations,
>>> because we need to take the stronger lock on a given table before any
>>> weaker one if it happens to appear in the query as a non-result relation
>>> too, to avoid lock strength upgrade deadlock hazard.
>>
>> Hmm. The problem with this theory in my view is that it doesn't
>> explain why InitPlan() and ExecOpenScanRelation() lock the relations
>> instead of just assuming that they are already locked either by
>> AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables()
>> doesn't really need to take locks, then ExecOpenScanRelation() must
>> not need to do it either. We invented ExecLockNonLeafAppendTables()
>> on the occasion of removing the scans of those tables which would
>> previously have caused ExecOpenScanRelation() to be invoked, so as to
>> keep the locking behavior unchanged.
>>
>> AcquireExecutorLocks() looks like an odd bit of code to me. The
>> executor itself locks result tables in InitPlan() and then everything
>> else during InitPlan() and all of the others later on while walking
>> the plan tree -- comments in InitPlan() say that this is to avoid a
>> lock upgrade hazard if a result rel is also a source rel. But
>> AcquireExecutorLocks() has no such provision; it just locks everything
>> in RTE order. In theory, that's a deadlock hazard of another kind, as
>> we just talked about in the context of EIBO. In fact, expanding in
>> bound order has made the situation worse: before, expansion order and
>> locking order were the same, so maybe having AcquireExecutorLocks()
>> work in RTE order coincidentally happened to give the same result as
>> the executor code itself as long as there are no result relations.
>> But this is certainly not true any more. I'm not sure it's worth
>> expending a lot of time on this -- it's evidently not a problem in
>> practice, or somebody probably would've complained before now.
>>
>> But that having been said, I don't think we should assume that all the
>> locks taken from the executor are worthless because plancache.c will
>> always do the job for us. I don't know of a case where we execute a
>> saved plan without going through the plan cache, but that doesn't mean
>> that there isn't one or that there couldn't be one in the future.
>> It's not the job of these partitioning patches to whack around the way
>> we do locking in general -- they should preserve the existing behavior
>> as much as possible. If we want to get rid of the locking in the
>> executor altogether, that's a separate discussion where, I have a
>> feeling, there will prove to be better reasons for the way things are
>> than we are right now supposing.
>>
>
> I agree that it's not the job of these patches to change the locking
> or even get rid of partitioned_rels. In order to continue returning
> partitioned_rels in Append paths esp. in the case of queries involving
> set operations and partitioned table e.g "select 1 from t1 union all
> select 2 from t1;" in which t1 is multi-level partitioned table, we
> need a fix in add_paths_to_append_rels(). The fix provided in [1] is
> correct but we will need a longer explanation of why we have to
> involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
> is complicated. If we get rid of partitioned_rels, we don't need to
> fix that code in add_paths_to_append_rel().
>
> I suggested that [2]
> -- (excerpt from [2])
>
> Actually, the original problem that caused this discussion started
> with an assertion failure in get_partitioned_child_rels() as
> Assert(list_length(result) >= 1);
>
> This assertion fails if result is NIL when an intermediate partitioned
> table is passed. May be we should assert (result == NIL ||
> list_length(result) == 1) and allow that function to be called even
> for intermediate partitioned partitions for which the function will
> return NIL. That will leave the code in add_paths_to_append_rel()
> simple. Thoughts?
> --
>
> Amit Langote agrees with this. It kind of makes the assertion lame but
> keeps the code sane. What do you think?

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path that gets converted into Append plan. Remember t1 is a
multi-level partitioned table here with t1p1 as its immediate
partition and t1p1p1 as partition of t1p1. So, the
set_append_rel_pathlist() recurses once as shown in the following
stack trace.

#0 add_paths_to_append_rel (root=0x23e4308, rel=0x23fb768,
live_childrels=0x23ff5f0) at allpaths.c:1281
#1 0x000000000076e170 in set_append_rel_pathlist (root=0x23e4308,
rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:1262
#2 0x000000000076cf23 in set_rel_pathlist (root=0x23e4308,
rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:431
#3 0x000000000076e0f6 in set_append_rel_pathlist (root=0x23e4308,
rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:1247
#4 0x000000000076cf23 in set_rel_pathlist (root=0x23e4308,
rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:431
#5 0x000000000076cc22 in set_base_rel_pathlists (root=0x23e4308) at
allpaths.c:309

When add_paths_to_append_rel() (frame 0) is called for t1, it gets
partitioned_rels and stuffs it in append path/s it creates. But those
paths are flattened into the append paths created for the set
operations when add_paths_to_append_rels() is called from frame 3.
While flattening the append paths in accumulate_append_subpath() we do
not pull any partitioned_rels that are stuffed in those paths and thus
the final append path/s created does not have partitioned_rels in
there.

The same behaviour is retained by my v30 patchset [1]. I think we
should go ahead by fixing add_paths_to_append_rel() as done in that
patchset. partitioned_rels needs to be removed from append paths
anyway, so that code will be removed when we do that.

[1] https://www.postgresql.org/message-id/CAFjFpRfHkJW3G=_PnSUc6PbXJE48AWYwyRzaGqtfKzzoU4wXXw@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-13 22:43:33
Message-ID:	CA+TgmoaKd7-HM7e6d3T7gKeP4E2shoZo__-wimp=9fiXBdD3NQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I debugged what happens in case of query "select 1 from t1 union all
> select 2 from t1;" with the current HEAD (without multi-level
> expansion patch attached). It doesn't set partitioned_rels in Append
> path that gets converted into Append plan. Remember t1 is a
> multi-level partitioned table here with t1p1 as its immediate
> partition and t1p1p1 as partition of t1p1. So, the
> set_append_rel_pathlist() recurses once as shown in the following
> stack trace.

Nice debugging. I spent some time today looking at this and I think
it's a bug in v10, and specifically in add_paths_to_append_rel(),
which only sets partitioned_rels correctly when the appendrel is a
partitioned rel, and not when it's a subquery RTE with one or more
partitioned queries beneath it.

Attached are two patches either one of which will fix it. First, I
wrote mechanical-partrels-fix.patch, which just mechanically
propagates partitioned_rels lists from accumulated subpaths into the
list used to construct the parent (Merge)AppendPath. I wasn't entire
happy with that, because it ends up building multiple partitioned_rels
lists for the same RelOptInfo. That seems silly, but there's no
principled way to avoid it; avoiding it amounts to hoping that all the
paths for the same relation carry the same partitioned_rels list,
which is uncomfortable.

So then I wrote pcinfo-for-subquery.patch. That patch notices when an
RTE_SUBQUERY appendrel is processed and accumulates the
partitioned_rels of its immediate children; in case there can be
multiple nested levels of subqueries before we get down to the actual
partitioned rel, it also adds a PartitionedChildRelInfo for the
subquery RTE, so that there's no need to walk the whole tree to build
the partitioned_rels list at higher levels, just the immediate
children. I find this fix a lot more satisfying. It adds less code
and does no extra work in the common case.

Notice that the choice of fix we adopt has consequences for your
0001-Multi-level-partitioned-table-expansion.patch -- with
mechanical-partrels-fix.patch, that patch could either associated all
partitioned_rels with the top-parent or it could work level by level
and everything would get properly assembled later. But with
pcinfo-for-subquery.patch, we need everything associated with the
top-parent. That doesn't seem like a problem to me, but it's
something to note.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
mechanical-partrels-fix.patch	application/octet-stream	7.7 KB
pcinfo-for-subquery.patch	application/octet-stream	3.6 KB

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-14 02:57:04
Message-ID:	ad18d62d-8e7b-5411-b164-a2b2580ae1a4@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/14 7:43, Robert Haas wrote:
> On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> I debugged what happens in case of query "select 1 from t1 union all
>> select 2 from t1;" with the current HEAD (without multi-level
>> expansion patch attached). It doesn't set partitioned_rels in Append
>> path that gets converted into Append plan. Remember t1 is a
>> multi-level partitioned table here with t1p1 as its immediate
>> partition and t1p1p1 as partition of t1p1. So, the
>> set_append_rel_pathlist() recurses once as shown in the following
>> stack trace.
>
> Nice debugging.

+1.

> I spent some time today looking at this and I think
> it's a bug in v10, and specifically in add_paths_to_append_rel(),
> which only sets partitioned_rels correctly when the appendrel is a
> partitioned rel, and not when it's a subquery RTE with one or more
> partitioned queries beneath it.
>
> Attached are two patches either one of which will fix it. First, I
> wrote mechanical-partrels-fix.patch, which just mechanically
> propagates partitioned_rels lists from accumulated subpaths into the
> list used to construct the parent (Merge)AppendPath. I wasn't entire
> happy with that, because it ends up building multiple partitioned_rels
> lists for the same RelOptInfo. That seems silly, but there's no
> principled way to avoid it; avoiding it amounts to hoping that all the
> paths for the same relation carry the same partitioned_rels list,
> which is uncomfortable.
>
> So then I wrote pcinfo-for-subquery.patch. That patch notices when an
> RTE_SUBQUERY appendrel is processed and accumulates the
> partitioned_rels of its immediate children; in case there can be
> multiple nested levels of subqueries before we get down to the actual
> partitioned rel, it also adds a PartitionedChildRelInfo for the
> subquery RTE, so that there's no need to walk the whole tree to build
> the partitioned_rels list at higher levels, just the immediate
> children. I find this fix a lot more satisfying. It adds less code
> and does no extra work in the common case.

I very much like pcinfo-for-subquery.patch, although I'm not sure if we
need to create PartitionedChildRelInfo for the sub-query parent RTE as the
patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL
subqueries are flattened way before we get to add_paths_to_append_rel();
if it could not be flattened, there wouldn't be a call to
add_paths_to_append_rel() in the first place, because no AppendRelInfos
would be generated. See what happens when is_simple_union_all_recurse()
returns false to flatten_simple_union_all() -- no AppendRelInfos will be
generated and added to root->append_rel_list in that case.

IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
like we're setting out to build for multi-level partitioned tables.

So, as things stand today, there can at most be one recursive call of
add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
sub-queries contain partitioned tables, but not more. The other patch
(multi-level expansion of partitioned tables) will change that, but even
then we won't need sub-query's own PartitioendChildRelInfo.

> Notice that the choice of fix we adopt has consequences for your
> 0001-Multi-level-partitioned-table-expansion.patch -- with
> mechanical-partrels-fix.patch, that patch could either associated all
> partitioned_rels with the top-parent or it could work level by level
> and everything would get properly assembled later. But with
> pcinfo-for-subquery.patch, we need everything associated with the
> top-parent. That doesn't seem like a problem to me, but it's
> something to note.

I think it's fine.

With 0001-Multi-level-partitioned-table-expansion.patch,
get_partitioned_child_rels() will get called even for non-root partitioned
tables, for which it won't find a valid pcinfo. I think that patch must
also change its callers to stop Asserting that a valid pcinfo is returned.

Spotted a typo in pcinfo-for-subquery.patch:

+ * A plain relation will alread have

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-14 12:06:58
Message-ID:	CAFjFpRd=1venqLL7oGU=C1dEkuvk2DJgvF+7uKbnPHaum1mvHQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 14, 2017 at 4:13 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> I debugged what happens in case of query "select 1 from t1 union all
>> select 2 from t1;" with the current HEAD (without multi-level
>> expansion patch attached). It doesn't set partitioned_rels in Append
>> path that gets converted into Append plan. Remember t1 is a
>> multi-level partitioned table here with t1p1 as its immediate
>> partition and t1p1p1 as partition of t1p1. So, the
>> set_append_rel_pathlist() recurses once as shown in the following
>> stack trace.
>
> Nice debugging. I spent some time today looking at this and I think
> it's a bug in v10, and specifically in add_paths_to_append_rel(),
> which only sets partitioned_rels correctly when the appendrel is a
> partitioned rel, and not when it's a subquery RTE with one or more
> partitioned queries beneath it.
>
> Attached are two patches either one of which will fix it. First, I
> wrote mechanical-partrels-fix.patch, which just mechanically
> propagates partitioned_rels lists from accumulated subpaths into the
> list used to construct the parent (Merge)AppendPath. I wasn't entire
> happy with that, because it ends up building multiple partitioned_rels
> lists for the same RelOptInfo. That seems silly, but there's no
> principled way to avoid it; avoiding it amounts to hoping that all the
> paths for the same relation carry the same partitioned_rels list,
> which is uncomfortable.
>
> So then I wrote pcinfo-for-subquery.patch. That patch notices when an
> RTE_SUBQUERY appendrel is processed and accumulates the
> partitioned_rels of its immediate children; in case there can be
> multiple nested levels of subqueries before we get down to the actual
> partitioned rel, it also adds a PartitionedChildRelInfo for the
> subquery RTE, so that there's no need to walk the whole tree to build
> the partitioned_rels list at higher levels, just the immediate
> children. I find this fix a lot more satisfying. It adds less code
> and does no extra work in the common case.

Thanks a lot for the patch. I have included pcinfo-for-subquery.patch
in my patchset as the first patch with typo corrections suggested by
Amit Langote.

>
> Notice that the choice of fix we adopt has consequences for your
> 0001-Multi-level-partitioned-table-expansion.patch -- with
> mechanical-partrels-fix.patch, that patch could either associated all
> partitioned_rels with the top-parent or it could work level by level
> and everything would get properly assembled later. But with
> pcinfo-for-subquery.patch, we need everything associated with the
> top-parent. That doesn't seem like a problem to me, but it's
> something to note.
>

I have few changes to multi-level expansion patch as per discussion in
earlier mails
1. expand_single_inheritance_child() gets the top parent's PlanRowMark
from which it builds the child's PlanRowMark and also update
allMarkTypes of the top parent's PlanRowMark. The chlid's PlanRowMark
contains the RTI of the top parent, which is pulled from the top
parent's PlanRowMark. This is to keep the old behaviour intact.

2. Updated expand_single_inheritance_child's prologue to explain
various output arguments, per suggestion from Amit Langote. Also
included comments about the way we construct child PlanRowMark. Please
see if the comments look good.

3. As suggested by Amit Langote, with multi-level partitioned table
expansion, intermediate partitioned tables won't have pcinfo
associated them. So, that patch removes the assertion
Assert(list_length(partitioned_rels) >= 1) in
add_paths_to_append_rels(). I didn't remove that assertion from your
patch so that you could cherry-pick that commit to v10 where that
assertion holds true.

4. Fixed inheritance_planner() to use top parent's RTE to pull
partitioned_rels per discussion with Amit few mails back [1].

Please let me know if I have missed anything; it's been some long discussion.

Apart from this I have included fix to reparameterize parallel nested
loop paths as per discussion in [2].

Please note that I have removed the advanced partitioning patches from
the attached patchset since those need a rebase because of default
partition support.

[1] https://www.postgresql.org/message-id/CAFjFpRe62H0rTb4Rb7wOVSR25xfNW+mt1Ncp-OtzGaEtZBTLwA@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAJ3gD9ctVgv6r0-7B6js7Z5uPHXx+KA5jK-3=uFsGwKOXfTddg@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v31.tar.gz	application/x-gzip	135.4 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-14 15:06:40
Message-ID:	CA+TgmobBUVgLOi_y=NkXffbj4QBONeLvrKL8nHJvUQsq8HQ5jQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2017 at 10:57 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I very much like pcinfo-for-subquery.patch, although I'm not sure if we
> need to create PartitionedChildRelInfo for the sub-query parent RTE as the
> patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL
> subqueries are flattened way before we get to add_paths_to_append_rel();
> if it could not be flattened, there wouldn't be a call to
> add_paths_to_append_rel() in the first place, because no AppendRelInfos
> would be generated. See what happens when is_simple_union_all_recurse()
> returns false to flatten_simple_union_all() -- no AppendRelInfos will be
> generated and added to root->append_rel_list in that case.
>
> IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
> like we're setting out to build for multi-level partitioned tables.
>
> So, as things stand today, there can at most be one recursive call of
> add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
> sub-queries contain partitioned tables, but not more. The other patch
> (multi-level expansion of partitioned tables) will change that, but even
> then we won't need sub-query's own PartitioendChildRelInfo.

OK, let's assume you're correct unless some contrary evidence emerges.
Committed without that part; thanks for the review.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-14 19:43:50
Message-ID:	CA+TgmobJ4HgSnU_VO57fgkhYgVzCKC_6EeVtkBNcT62Arv2D2g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I have few changes to multi-level expansion patch as per discussion in
> earlier mails

OK, I have committed
0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
changes.

Phew, getting that sorted out has been an astonishing amount of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-15 00:38:55
Message-ID:	7d926625-3f80-e63e-43d9-5b8acb9b401c@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/09/15 4:43, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> I have few changes to multi-level expansion patch as per discussion in
>> earlier mails
>
> OK, I have committed
> 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
> changes.
>
> Phew, getting that sorted out has been an astonishing amount of work.

Yeah, thanks to both of you. Now on to other complicated stuff. :)

Regards,
Amit

From:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-15 08:39:29
Message-ID:	CAOGQiiO4+ez-MKtcKiTUOqbN8_TFLotto9sNZEjfATVi+0FVSQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 14, 2017 at 8:27 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2017/09/14 7:43, Robert Haas wrote:
>> On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> I debugged what happens in case of query "select 1 from t1 union all
>>> select 2 from t1;" with the current HEAD (without multi-level
>>> expansion patch attached). It doesn't set partitioned_rels in Append
>>> path that gets converted into Append plan. Remember t1 is a
>>> multi-level partitioned table here with t1p1 as its immediate
>>> partition and t1p1p1 as partition of t1p1. So, the
>>> set_append_rel_pathlist() recurses once as shown in the following
>>> stack trace.
>>
>> Nice debugging.
>
> +1.
>
>> I spent some time today looking at this and I think
>> it's a bug in v10, and specifically in add_paths_to_append_rel(),
>> which only sets partitioned_rels correctly when the appendrel is a
>> partitioned rel, and not when it's a subquery RTE with one or more
>> partitioned queries beneath it.
>>
>> Attached are two patches either one of which will fix it. First, I
>> wrote mechanical-partrels-fix.patch, which just mechanically
>> propagates partitioned_rels lists from accumulated subpaths into the
>> list used to construct the parent (Merge)AppendPath. I wasn't entire
>> happy with that, because it ends up building multiple partitioned_rels
>> lists for the same RelOptInfo. That seems silly, but there's no
>> principled way to avoid it; avoiding it amounts to hoping that all the
>> paths for the same relation carry the same partitioned_rels list,
>> which is uncomfortable.
>>
>> So then I wrote pcinfo-for-subquery.patch. That patch notices when an
>> RTE_SUBQUERY appendrel is processed and accumulates the
>> partitioned_rels of its immediate children; in case there can be
>> multiple nested levels of subqueries before we get down to the actual
>> partitioned rel, it also adds a PartitionedChildRelInfo for the
>> subquery RTE, so that there's no need to walk the whole tree to build
>> the partitioned_rels list at higher levels, just the immediate
>> children. I find this fix a lot more satisfying. It adds less code
>> and does no extra work in the common case.
>
> I very much like pcinfo-for-subquery.patch, although I'm not sure if we
> need to create PartitionedChildRelInfo for the sub-query parent RTE as the
> patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL
> subqueries are flattened way before we get to add_paths_to_append_rel();
> if it could not be flattened, there wouldn't be a call to
> add_paths_to_append_rel() in the first place, because no AppendRelInfos
> would be generated. See what happens when is_simple_union_all_recurse()
> returns false to flatten_simple_union_all() -- no AppendRelInfos will be
> generated and added to root->append_rel_list in that case.
>
> IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
> like we're setting out to build for multi-level partitioned tables.
>
> So, as things stand today, there can at most be one recursive call of
> add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
> sub-queries contain partitioned tables, but not more. The other patch
> (multi-level expansion of partitioned tables) will change that, but even
> then we won't need sub-query's own PartitioendChildRelInfo.
>
>> Notice that the choice of fix we adopt has consequences for your
>> 0001-Multi-level-partitioned-table-expansion.patch -- with
>> mechanical-partrels-fix.patch, that patch could either associated all
>> partitioned_rels with the top-parent or it could work level by level
>> and everything would get properly assembled later. But with
>> pcinfo-for-subquery.patch, we need everything associated with the
>> top-parent. That doesn't seem like a problem to me, but it's
>> something to note.
>
> I think it's fine.
>
> With 0001-Multi-level-partitioned-table-expansion.patch,
> get_partitioned_child_rels() will get called even for non-root partitioned
> tables, for which it won't find a valid pcinfo. I think that patch must
> also change its callers to stop Asserting that a valid pcinfo is returned.
>
> Spotted a typo in pcinfo-for-subquery.patch:
>
> + * A plain relation will alread have
>
> Thanks,
> Amit
>
On TPC-H benchmarking of this patch, I found a regression in Q7. It
was taking some 1500s with the patch and some 900s without the patch.
Please find the attached pwd_reg.zip for the output of explain analyse
on head and with patch.

The experimental settings used were,
commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
patch-version = v26

Server settings:
work_mem = 1GB
shared_buffers = 10GB
effective_cache_size = 10GB
max_parallel_workers_per_gather = 4

Partitioning information:
Partitioning scheme = by range
Number of partitions in lineitem and orders table = 106
partition key for lineitem = l_orderkey
partition key for orders = o_orderkey

Apart from these there is a regression case on a custom table, on head
query completes in 20s and with this patch it takes 27s. Please find
the attached .out and .sql file for the output and schema for the test
case respectively. I have reported this case before (sometime around
March this year) as well, but I am not sure if it was overlooked or is
an unimportant and expected behaviour for some reason.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachment	Content-Type	Size
pwj_reg.out	application/octet-stream	31.7 KB
test_case_pwj.sql	application/octet-stream	17.6 KB
pwj_reg.zip	application/zip	81.9 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-15 10:11:23
Message-ID:	CAFjFpRcuqH84WEnKp2_sd2MYbOiuuQtGuKYZmetERVXLO1yOYQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 15, 2017 at 1:13 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> I have few changes to multi-level expansion patch as per discussion in
>> earlier mails
>
> OK, I have committed
> 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
> changes.
>
> Phew, getting that sorted out has been an astonishing amount of work.

Thanks a lot Robert.

Here are rebased patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v32.tar.gz	application/x-gzip	127.4 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-15 11:59:21
Message-ID:	CAFjFpRcT4DRCc1tON1pDEfixi0pp=5Sp2zcuypBFHwOcDcgvRQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
<rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
> On TPC-H benchmarking of this patch, I found a regression in Q7. It
> was taking some 1500s with the patch and some 900s without the patch.
> Please find the attached pwd_reg.zip for the output of explain analyse
> on head and with patch.
>
> The experimental settings used were,
> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
> patch-version = v26
>
> Server settings:
> work_mem = 1GB
> shared_buffers = 10GB
> effective_cache_size = 10GB
> max_parallel_workers_per_gather = 4
>
> Partitioning information:
> Partitioning scheme = by range
> Number of partitions in lineitem and orders table = 106
> partition key for lineitem = l_orderkey
> partition key for orders = o_orderkey

I observe that with partition-wise join patch the planner is using
GatherMerge along-with partition-wise join and on head its not using
GatherMerge. Just to make sure that its partition-wise join which is
causing regression and not GatherMerge, can you please run the query
with enable_gathermerge = false?

I see following lines explain analyze output 7_1.out without the patch
-> Sort (cost=84634030.40..84638520.55 rows=1796063
width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1)
Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
Sort Method: quicksort Memory: 308912kB
-> Hash Join (cost=16080591.94..84447451.72
rows=1796063 width=72) (actual time=252745.701..1057447.219
rows=1749956 loops=1)
Since Sort doesn't filter any rows, we would expect it to output the
same number of rows as hash join underneath it. But the number of rows
differ in this case. I am wondering whether there's some problem with
the explain analyze output itself.

>
> Apart from these there is a regression case on a custom table, on head
> query completes in 20s and with this patch it takes 27s. Please find
> the attached .out and .sql file for the output and schema for the test
> case respectively. I have reported this case before (sometime around
> March this year) as well, but I am not sure if it was overlooked or is
> an unimportant and expected behaviour for some reason.
>

Are you talking about [1]? I have explained about the regression in
[2] and [3]. This looks like an issue with the existing costing model.

[1] https://www.postgresql.org/message-id/CAOGQiiMwcjNrunJ_fCDBscrTLeJ-CLp7exfzzipe2ut71n4LUA@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eYfb7ieG5d0D1PbsQ@mail.gmail.com
[3] https://www.postgresql.org/message-id/CAFjFpReJKSdCfaeuZjGD79hOETzpz5BKDxLJgxr7qznrXX+TRw@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-15 21:23:59
Message-ID:	CA+Tgmoa0c4QGS2Hf7izoqoUzb2CTE=kA2Tjn7k9++-ANKnpV=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Thanks a lot Robert.
>
> Here are rebased patches.

I didn't get quite as much time to look at these today as I would have
liked, but here's what I've got so far.

Comments on 0001:

- In the RelOptInfo, part_oids is defined in a completely different
part of the structure than nparts, but you can't use it without nparts
because you don't know how long it is. I suggest moving the
definition to just after nparts.

- On the other hand, maybe we should just remove it completely. I
don't see it in any of the subsequent patches. If it's used by the
advanced matching code, let's leave it out of 0001 for now and add it
back after the basic feature is committed.

- Similarly, partsupfunc isn't used by the later patches either. It
seems it could also be removed, at least for now.

- The comment for partexprs isn't very clear about how the lists
inside the array work. My understanding is that the lists have as
many members as the partition key has columns/expressions.

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design. If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property. You could consider something like
"Relids nullable_rels".

Comments on 0002:

- The relationship between deciding to set partition scheme and
related details and the configured value of enable_partition_wise_join
needs some consideration. If the only use of the partition scheme is
partition-wise join, there's no point in setting it even for a baserel
unless enable_partition_wise_join is set -- but if there are other
important uses for that data, such as Amit's partition pruning work,
then we might want to always set it. And similarly for a join: if the
details are only needed in the partition-wise join case, then we only
need to set them in that case, but if there are other uses, then it's
different. If it turns out that setting these details for a baserel
is useful in other cases but that it's only for a joinrel in the
partition-wise join case, then the patch gets it exactly right. But
is that correct? I'm not sure.

- The naming of enable_partition_wise_join might also need some
thought. What happens when we also have partition-wise aggregate?
What about the proposal to strength-reduce MergeAppend to Append --
would that use this infrastructure? I wonder if we out to call this
enable_partition_wise or enable_partition_wise_planning to make it a
bit more general. Then, too, I've never really liked having
partition_wise in the GUC name because it might make someone think
that it makes you partitions have a lot of wisdom. Removing the
underscore might help: partitionwise. Or maybe there is some whole
different name that would be better. If anyone wants to bikeshed,
now's the time.

- It seems to me that build_joinrel_partition_info() could be
simplified a bit. One thing is that list_copy() is perfectly capable
of handling a NIL input, so there's no need to test for that before
calling it.

Comments on 0003:

- Instead of reorganizing add_paths_to_append_rel as you did, could
you just add an RTE_JOIN case to the switch? Not sure if there's some
problem with that idea, but it seems like it might come out nicer.

On the overall patch set:

- I am curious to know how this has been tested. How much of the new
code is covered by the tests in 0007-Partition-wise-join-tests.patch?
How much does coverage improve with
0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
code, if any, is not covered by either of those test suites? Could we
do meaningful testing of this with something like Andreas
Seltenreich's sqlsmith?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-15 21:38:15
Message-ID:	CAEepm=0EvDYJFVYOSZuOBF52F3D2Df35Vij7ehm5nDpQ4ohSGQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On the overall patch set:
>
> - I am curious to know how this has been tested. How much of the new
> code is covered by the tests in 0007-Partition-wise-join-tests.patch?
> How much does coverage improve with
> 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
> code, if any, is not covered by either of those test suites? Could we
> do meaningful testing of this with something like Andreas
> Seltenreich's sqlsmith?

FWIW I'm working on an answer to both of those question, but keep
getting distracted by other things catching on fire...

--
Thomas Munro
http://www.enterprisedb.com

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-16 02:41:59
Message-ID:	CAEepm=1ZHk4+LBJXoy848bKKqnPcLWAkCbYSMAiZKfDbXYQunw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 16, 2017 at 9:38 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On the overall patch set:
>>
>> - I am curious to know how this has been tested. How much of the new
>> code is covered by the tests in 0007-Partition-wise-join-tests.patch?
>> How much does coverage improve with
>> 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
>> code, if any, is not covered by either of those test suites? Could we
>> do meaningful testing of this with something like Andreas
>> Seltenreich's sqlsmith?
>
> FWIW I'm working on an answer to both of those question, but keep
> getting distracted by other things catching on fire...

I cobbled together some scripts to figure out the test coverage of
lines actually modified by this patch set. Please see attached.

I'm not sure if there is an established or better way to do this, but
I used git-blame to figure out which lines of gcov output can be
blamed on Ashutosh and prepended that to the lines of gcov's output.
That allowed me to find new/changed code not covered by "make check".
I found 94 untested new lines with 0007 applied and 88 untested new
lines with 0008 applied. The 6 lines that 0008 reaches and 0007
doesn't are:

======== src/backend/optimizer/path/allpaths.c ========
-[TOUCHED BY PATCH SET] #####: 3303: mark_dummy_rel(rel);
-[TOUCHED BY PATCH SET] #####: 3304: return;
-[TOUCHED BY PATCH SET] #####: 1515: continue;
-[TOUCHED BY PATCH SET] #####: 1526: continue;
======== src/backend/optimizer/util/pathnode.c ========
-[TOUCHED BY PATCH SET] #####: 3433: break;
-[TOUCHED BY PATCH SET] #####: 3435: return NULL;

--
Thomas Munro
http://www.enterprisedb.com

Attachment	Content-Type	Size
patchset-coverage-0007.txt.gz	application/x-gzip	503.1 KB

From:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-18 04:48:24
Message-ID:	CAOGQiiN9m=KRf-et1T0AcimbyAB9hDzJqGkHnOBjWT4uF1z1BQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
> <rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>> On TPC-H benchmarking of this patch, I found a regression in Q7. It
>> was taking some 1500s with the patch and some 900s without the patch.
>> Please find the attached pwd_reg.zip for the output of explain analyse
>> on head and with patch.
>>
>> The experimental settings used were,
>> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
>> patch-version = v26
>>
>> Server settings:
>> work_mem = 1GB
>> shared_buffers = 10GB
>> effective_cache_size = 10GB
>> max_parallel_workers_per_gather = 4
>>
>> Partitioning information:
>> Partitioning scheme = by range
>> Number of partitions in lineitem and orders table = 106
>> partition key for lineitem = l_orderkey
>> partition key for orders = o_orderkey
>
> I observe that with partition-wise join patch the planner is using
> GatherMerge along-with partition-wise join and on head its not using
> GatherMerge. Just to make sure that its partition-wise join which is
> causing regression and not GatherMerge, can you please run the query
> with enable_gathermerge = false?
>
That does not sound plausible since around 130s are already spent till
append node. Anyhow, I executed the query with enable_gathermerge =
false, and still it is taking some 1500 secs. Please find the attached
file for the explain analyse output.

> I see following lines explain analyze output 7_1.out without the patch
> -> Sort (cost=84634030.40..84638520.55 rows=1796063
> width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1)
> Sort Key: n1.n_name, n2.n_name,
> (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
> time zone))
> Sort Method: quicksort Memory: 308912kB
> -> Hash Join (cost=16080591.94..84447451.72
> rows=1796063 width=72) (actual time=252745.701..1057447.219
> rows=1749956 loops=1)
> Since Sort doesn't filter any rows, we would expect it to output the
> same number of rows as hash join underneath it. But the number of rows
> differ in this case. I am wondering whether there's some problem with
> the explain analyze output itself.
>

Limit (cost=83341943.28..83341943.35 rows=1 width=92) (actual
time=1556989.996..1556989.997 rows=1 loops=1)
-> Finalize GroupAggregate (cost=83341943.28..83342723.24
rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1
loops=1)
Group Key: n1.n_name, n2.n_name, (date_part('year'::text,
(lineitem_001.l_shipdate)::timestamp without time zone))
-> Sort (cost=83341943.28..83342043.92 rows=40256 width=92)
(actual time=1556989.910..1556989.911 rows=6 loops=1)
Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
Sort Method: quicksort Memory: 27kB
-> Gather (cost=83326804.81..83338864.31 rows=40256
width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1)
Workers Planned: 4
Workers Launched: 4

AFAICU the node above sort is group-aggregate and then there is limit,
and the number of rows for sort node in explain analyse is returned
number of rows. So, what is happening here is once one group is
completed it is aggregated and fetched by limit, now there is no need
for sort to return any more rows and hence the result.
>>
>> Apart from these there is a regression case on a custom table, on head
>> query completes in 20s and with this patch it takes 27s. Please find
>> the attached .out and .sql file for the output and schema for the test
>> case respectively. I have reported this case before (sometime around
>> March this year) as well, but I am not sure if it was overlooked or is
>> an unimportant and expected behaviour for some reason.
>>
>
> Are you talking about [1]? I have explained about the regression in
> [2] and [3]. This looks like an issue with the existing costing model.
>
> [1] https://www.postgresql.org/message-id/CAOGQiiMwcjNrunJ_fCDBscrTLeJ-CLp7exfzzipe2ut71n4LUA@mail.gmail.com
> [2] https://www.postgresql.org/message-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eYfb7ieG5d0D1PbsQ@mail.gmail.com
> [3] https://www.postgresql.org/message-id/CAFjFpReJKSdCfaeuZjGD79hOETzpz5BKDxLJgxr7qznrXX+TRw@mail.gmail.com
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachment	Content-Type	Size
7_gm_false.out	application/octet-stream	129.3 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-18 12:02:06
Message-ID:	CAFjFpRfHLrgni-1+C14Nj1R96dje-rGNorgEs1qvGJtqTM6=vQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Thanks a lot Robert.
>>
>> Here are rebased patches.
>
> I didn't get quite as much time to look at these today as I would have
> liked, but here's what I've got so far.
>
> Comments on 0001:
>
> - In the RelOptInfo, part_oids is defined in a completely different
> part of the structure than nparts, but you can't use it without nparts
> because you don't know how long it is. I suggest moving the
> definition to just after nparts.
>
> - On the other hand, maybe we should just remove it completely. I
> don't see it in any of the subsequent patches. If it's used by the
> advanced matching code, let's leave it out of 0001 for now and add it
> back after the basic feature is committed.

No, it's not used by advanced partition matching code. It was used by
to match OIDs with the child rels to order those in the array. But now
that we are expanding in EIBO fashion, it is not useful. Should have
removed it earlier. Removed now.

>
> - Similarly, partsupfunc isn't used by the later patches either. It
> seems it could also be removed, at least for now.

It's used by advanced partition matching code to compare bounds. It
will be required by partition pruning patch. But removed for now.

>
> - The comment for partexprs isn't very clear about how the lists
> inside the array work. My understanding is that the lists have as
> many members as the partition key has columns/expressions.

Actually we are doing some preparation for partition-wise join here.
partexprs and nullable_partexprs are used in partition-wise join
implementation patch. I have updated prologue of RelOptInfo structure
with the comments like below

* Note: A base relation will always have only one set of partition keys. But a
* join relation has as many sets of partition keys as the number of joining
* relations. The number of partition keys is given by
* "part_scheme->partnatts". "partexprs" and "nullable_partexprs" are arrays
* containing part_scheme->partnatts elements. Each element of the array
* contains a list of partition key expressions. For a base relation each list
* contains only one expression. For a join relation each list contains at
* most as many expressions as the joining relations. The expressions in a list
* at a given position in the array correspond to the partition key at that
* position. "partexprs" contains partition keys of non-nullable joining
* relations and "nullable_partexprs" contains partition keys of nullable
* joining relations. For a base relation only "partexprs" is populated.

Let me know this looks fine. The logic to match the partition keys of
joining relations in have_partkey_equi_join() and
match_expr_to_partition_keys() becomes simpler if we arrange the
partition key expressions as array indexed by position of partition
key and each array element as list of partition key expressions at
that position.

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

>
> - I'm not entirely sure whether maintaining partexprs and
> nullable_partexprs is the right design. If I understand correctly,
> whether or not a partexpr is nullable is really a per-RTI property,
> not a per-expression property. You could consider something like
> "Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

>
> Comments on 0002:
>
> - The relationship between deciding to set partition scheme and
> related details and the configured value of enable_partition_wise_join
> needs some consideration. If the only use of the partition scheme is
> partition-wise join, there's no point in setting it even for a baserel
> unless enable_partition_wise_join is set -- but if there are other
> important uses for that data, such as Amit's partition pruning work,
> then we might want to always set it. And similarly for a join: if the
> details are only needed in the partition-wise join case, then we only
> need to set them in that case, but if there are other uses, then it's
> different. If it turns out that setting these details for a baserel
> is useful in other cases but that it's only for a joinrel in the
> partition-wise join case, then the patch gets it exactly right. But
> is that correct? I'm not sure.

Partition scheme contains the information about data types of
partition keys, which is required to compare partition bounds.
Partition pruning will need to compare constants with partition bounds
and hence will need information contained in partition scheme. So, we
will need to set it for base relations whether or not partition-wise
join is enabled.

>
> - The naming of enable_partition_wise_join might also need some
> thought. What happens when we also have partition-wise aggregate?
> What about the proposal to strength-reduce MergeAppend to Append --
> would that use this infrastructure? I wonder if we out to call this
> enable_partition_wise or enable_partition_wise_planning to make it a
> bit more general. Then, too, I've never really liked having
> partition_wise in the GUC name because it might make someone think
> that it makes you partitions have a lot of wisdom. Removing the
> underscore might help: partitionwise. Or maybe there is some whole
> different name that would be better. If anyone wants to bikeshed,
> now's the time.

partitions having a lot of wisdom would be wise_partitions rather than
partition_wise ;).

If partition-wise join is disabled, partition-wise aggregates,
strength reduction of MergeAppend won't be possible on a join tree,
but those will be possible on a base relation. Even if partition-wise
join enabled, one may want to disable other partition-wise
optimizations individually. So, they are somewhat independent
switches. I don't think we should bundle all of those into one.
Whatever names we choose for those GUCs, I think they should have same
naming convention e.g. "partition_wise_xyz". I am open to suggestions
about the names.

>
> - It seems to me that build_joinrel_partition_info() could be
> simplified a bit. One thing is that list_copy() is perfectly capable
> of handling a NIL input, so there's no need to test for that before
> calling it.

partexprs may be NULL for FULL JOIN and nullable_partexprs may be NULL
when there is no nullable relation. So, we have to check existence of
those arrays before accessing lists containing partition key
expressions. list_copy() is being called on individual array elements
and "if" conditions check for the existence of array.

The functions might have become complicated because I am using
outer/inner_partexprs to hold one of the lists and partexprs contains
the array of lists. We may use better named, but I don't have any
better ideas right now. Will think about them.

We could simplify that function according to your suggestion of
nullable_relids. Basically partexprs then contains partition key
expressions all relations nullable and non-nullable. nullable_relids +
pull_varnos() tells us which of those fall on nullable side and which
ones don't. Is this how you are thinking of simplifying it? If we go
with this scheme, again nullable_relids will not be useful for
partition pruning, so may be we should add it as part of 0002
(partition-wise join implementation) instead of 0001.

>
> Comments on 0003:
>
> - Instead of reorganizing add_paths_to_append_rel as you did, could
> you just add an RTE_JOIN case to the switch? Not sure if there's some
> problem with that idea, but it seems like it might come out nicer.

RTE_JOIN is created only for joins specified using JOIN clause i.e
syntactic joins. The joins created during query planner like rel1,
rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-18 21:05:43
Message-ID:	CA+TgmoaCG72nMadCmAj7XPNhaeashj7TDjcvvQO6F4My8pQjgg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> partition pruning might need partexprs look up relevant quals, but
> nullable_partexprs doesn't have any use there. So may be we should add
> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
> implementation) instead of 0001. What do you think?

+1.

>> - I'm not entirely sure whether maintaining partexprs and
>> nullable_partexprs is the right design. If I understand correctly,
>> whether or not a partexpr is nullable is really a per-RTI property,
>> not a per-expression property. You could consider something like
>> "Relids nullable_rels".
>
> That's true. However in order to decide whether an expression falls on
> nullable side of a join, we will need to call pull_varnos() on it and
> check the output against nullable_rels. Separating the expressions
> themselves avoids that step.

Good point. Also, I'm not sure about cases like this:

SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;

Suppose the relations are all partitioned by (x, y) but that the =
operator is not strict. A partition-wise join is valid between a and
b, but we can't regard w as partitioned any more, because w.x might
contain nulls in partitions where the partitioning scheme wouldn't
allow them. On the other hand, if the subquery were to select a.x,
a.y then clearly it would be fine: there would be no possibility of a
NULL having been substituted for a proper value.

What if the subquery selected a.x, b.y? Initially, I thought that
would be OK too, because of the fact that the a.y = b.y clause is in
the WHERE clause rather than the join condition. But on further
thought I think that probably doesn't work, because with = being a
non-strict operator there's no guarantee that it would remove any
nulls introduced by the left join. Of course, if the subselect had a
WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
list mention those columns would be fine.

>> - The naming of enable_partition_wise_join might also need some
>> thought. What happens when we also have partition-wise aggregate?
>> What about the proposal to strength-reduce MergeAppend to Append --
>> would that use this infrastructure? I wonder if we out to call this
>> enable_partition_wise or enable_partition_wise_planning to make it a
>> bit more general. Then, too, I've never really liked having
>> partition_wise in the GUC name because it might make someone think
>> that it makes you partitions have a lot of wisdom. Removing the
>> underscore might help: partitionwise. Or maybe there is some whole
>> different name that would be better. If anyone wants to bikeshed,
>> now's the time.
>
> partitions having a lot of wisdom would be wise_partitions rather than
> partition_wise ;).

Well, maybe it's the joins that have a lot of wisdom, then.
enable_partition_wise_join could be read to mean that we should allow
partitioning of joins, but only if those joins know the secret of true
happiness.

> If partition-wise join is disabled, partition-wise aggregates,
> strength reduction of MergeAppend won't be possible on a join tree,
> but those will be possible on a base relation. Even if partition-wise
> join enabled, one may want to disable other partition-wise
> optimizations individually. So, they are somewhat independent
> switches. I don't think we should bundle all of those into one.
> Whatever names we choose for those GUCs, I think they should have same
> naming convention e.g. "partition_wise_xyz". I am open to suggestions
> about the names.

I think the chances of you getting multiple GUCs for different
partition-wise optimizations past Tom are pretty low.

>> - Instead of reorganizing add_paths_to_append_rel as you did, could
>> you just add an RTE_JOIN case to the switch? Not sure if there's some
>> problem with that idea, but it seems like it might come out nicer.
>
> RTE_JOIN is created only for joins specified using JOIN clause i.e
> syntactic joins. The joins created during query planner like rel1,
> rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there.

OK, never mind that then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-19 09:28:53
Message-ID:	CAOGQiiO7nX_-_0Z42hBLh8qgf0=KEnNgs3ARRrNiG_7OTdvk4w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
<rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>>
> On TPC-H benchmarking of this patch, I found a regression in Q7. It
> was taking some 1500s with the patch and some 900s without the patch.
> Please find the attached pwd_reg.zip for the output of explain analyse
> on head and with patch.
>
> The experimental settings used were,
> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
> patch-version = v26
>
> Server settings:
> work_mem = 1GB
> shared_buffers = 10GB
> effective_cache_size = 10GB
> max_parallel_workers_per_gather = 4
>
> Partitioning information:
> Partitioning scheme = by range
> Number of partitions in lineitem and orders table = 106
> partition key for lineitem = l_orderkey
> partition key for orders = o_orderkey
>
> Apart from these there is a regression case on a custom table, on head
> query completes in 20s and with this patch it takes 27s. Please find
> the attached .out and .sql file for the output and schema for the test
> case respectively. I have reported this case before (sometime around
> March this year) as well, but I am not sure if it was overlooked or is
> an unimportant and expected behaviour for some reason.
>

On completing the benchmark for all queries for the above mentioned
setup, following performance improvement can be seen,
Query | Patch | Head
3 | 1455 | 1631
4 | 499 | 4344
5 | 1464 | 1606
10 | 1475 | 1599
12 | 1465 | 1790

Note that all values of execution time are in seconds.
To summarise, apart from Q4, all other queries are showing somewhat
10-20% improvement. Though it is good but honestly, I expected more
from this feature atleast on this scale factor. I am yet to compare
these performances with the unpartitioned version of the database.

Please find attached file for the output of explain analyse for all
the queries on head and with patch.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachment	Content-Type	Size
18sept.zip	application/zip	821.8 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-19 09:47:45
Message-ID:	CAFjFpRfneFG3H+F6BaiXemMrKF+FY-POpx3Ocy+RiH3yBmXSNw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> partition pruning might need partexprs look up relevant quals, but
>> nullable_partexprs doesn't have any use there. So may be we should add
>> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
>> implementation) instead of 0001. What do you think?
>
> +1.

Done.

>
>>> - I'm not entirely sure whether maintaining partexprs and
>>> nullable_partexprs is the right design. If I understand correctly,
>>> whether or not a partexpr is nullable is really a per-RTI property,
>>> not a per-expression property. You could consider something like
>>> "Relids nullable_rels".
>>
>> That's true. However in order to decide whether an expression falls on
>> nullable side of a join, we will need to call pull_varnos() on it and
>> check the output against nullable_rels. Separating the expressions
>> themselves avoids that step.
>
> Good point. Also, I'm not sure about cases like this:
>
> SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
> a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
>
> Suppose the relations are all partitioned by (x, y) but that the =
> operator is not strict. A partition-wise join is valid between a and
> b, but we can't regard w as partitioned any more, because w.x might
> contain nulls in partitions where the partitioning scheme wouldn't
> allow them. On the other hand, if the subquery were to select a.x,
> a.y then clearly it would be fine: there would be no possibility of a
> NULL having been substituted for a proper value.
>
> What if the subquery selected a.x, b.y? Initially, I thought that
> would be OK too, because of the fact that the a.y = b.y clause is in
> the WHERE clause rather than the join condition. But on further
> thought I think that probably doesn't work, because with = being a
> non-strict operator there's no guarantee that it would remove any
> nulls introduced by the left join. Of course, if the subselect had a
> WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
> list mention those columns would be fine.
>

I am actually not sure whether we can use partition-wise join for a
LEFT JOIN b when the partition key equalities are spread across ON and
WHERE clauses. I am not able to find any example against it, but I am
not able to prove it as well. The reference I used for partition-wise
join [1], mentions JOIN conditions i.e. ON clause conditions. But all
the examples used in that paper are that of INNER join. So, I am not
sure what exactly the authors meant by JOIN conditions. Right now I am
restricting the patch to work with only conditions in the ON clause.

Practically most of the operators are strict. OUTER join's WHERE
clause has any partition key equality with strict operator, optimizer
will turn
that OUTER join into an INNER one, turning all clauses into join
clauses. That will enable partition-wise join. So, the current
restriction doesn't restrict any practical cases.

OTOH, I have seen that treating ON and WHERE clauses as same for an
OUTER join leads to surprising results. So, I am leaning to treat them
separate for partition-wise join as well and only use ON clause
conditions for partition-wise join. If we get complaints about
partition-wise join not being picked we will fix them after proving
that it's not harmful. Lifting that restriction is not so difficult.
have_partition_key_equijoin() ignores "pushed down" quals. We have to
just change that condition.

Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL
is interesting. If those conditions are in ON clause, we may still
have a result where b.x and b.y as NULL when no row in "a" matches a
row in "b". If those conditions are in WHERE clause, I think optimizer
will turn the join into an INNER join irrespective of whether the
equality operator is strict.

>
>> If partition-wise join is disabled, partition-wise aggregates,
>> strength reduction of MergeAppend won't be possible on a join tree,
>> but those will be possible on a base relation. Even if partition-wise
>> join enabled, one may want to disable other partition-wise
>> optimizations individually. So, they are somewhat independent
>> switches. I don't think we should bundle all of those into one.
>> Whatever names we choose for those GUCs, I think they should have same
>> naming convention e.g. "partition_wise_xyz". I am open to suggestions
>> about the names.
>
> I think the chances of you getting multiple GUCs for different
> partition-wise optimizations past Tom are pretty low.

We do have enable_hashjoin and enable_hashagg to control use of
hashing for aggregate and join. On similar lines we can have three
GUCs to enable use of partition-wise strategy, one for each of join,
aggregation and sorting. Having granular switches would be useful for
debugging and may be to turn partition-wise strategies off when they
are not optimal. Do we want a switch to turn ON/OFF partition pruning?
Said, that I am fine with single GUC controlling all. We won't set any
partitioning information in RelOptInfo if that GUC is turned OFF.

[1] https://pdfs.semanticscholar.org/27c2/ba75f8b6a39d4bce85d5579dace609c9abaa.pdf
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v33.tar.gz	application/x-gzip	127.8 KB

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-19 10:20:05
Message-ID:	20170919102005.pn7rtotdymlitpyt@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Rafia Sabih wrote:

> On completing the benchmark for all queries for the above mentioned
> setup, following performance improvement can be seen,
> Query | Patch | Head
> 3 | 1455 | 1631
> 4 | 499 | 4344
> 5 | 1464 | 1606
> 10 | 1475 | 1599
> 12 | 1465 | 1790
>
> Note that all values of execution time are in seconds.
> To summarise, apart from Q4, all other queries are showing somewhat
> 10-20% improvement.

Saving 90% of time on the slowest query looks like a worthy improvement
on its own right. However, you're reporting execution time only, right?
What happens to planning time? In a quick look,

$ grep 'Planning time' pg_part_*/4*
pg_part_head/4_1.out: Planning time: 3390.699 ms
pg_part_head/4_2.out: Planning time: 194.211 ms
pg_part_head/4_3.out: Planning time: 210.964 ms
pg_part_head/4_4.out: Planning time: 4150.647 ms
pg_part_patch/4_1.out: Planning time: 7577.247 ms
pg_part_patch/4_2.out: Planning time: 312.421 ms
pg_part_patch/4_3.out: Planning time: 304.697 ms
pg_part_patch/4_4.out: Planning time: 269.778 ms

I think the noise in these few results is too large to draw any
conclusions. Maybe a few dozen runs of EXPLAIN (w/o ANALYZE) would tell
something significant?

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-20 04:14:59
Message-ID:	CAEepm=0JfKkAS6Ea8HgsoPJUUnL4++V0q7mPucFEiqn7cPmO0A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 16, 2017 at 2:41 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Sat, Sep 16, 2017 at 9:38 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On the overall patch set:
>>>
>>> - I am curious to know how this has been tested. How much of the new
>>> code is covered by the tests in 0007-Partition-wise-join-tests.patch?
>>> How much does coverage improve with
>>> 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
>>> code, if any, is not covered by either of those test suites? Could we
>>> do meaningful testing of this with something like Andreas
>>> Seltenreich's sqlsmith?
>>
>> FWIW I'm working on an answer to both of those question, but keep
>> getting distracted by other things catching on fire...
>
> I cobbled together some scripts to figure out the test coverage of
> lines actually modified by this patch set. Please see attached.
>
> I'm not sure if there is an established or better way to do this, but
> I used git-blame to figure out which lines of gcov output can be
> blamed on Ashutosh and prepended that to the lines of gcov's output.
> That allowed me to find new/changed code not covered by "make check".
> I found 94 untested new lines with 0007 applied and 88 untested new
> lines with 0008 applied. The 6 lines that 0008 reaches and 0007
> doesn't are:
>
> ======== src/backend/optimizer/path/allpaths.c ========
> -[TOUCHED BY PATCH SET] #####: 3303: mark_dummy_rel(rel);
> -[TOUCHED BY PATCH SET] #####: 3304: return;
> -[TOUCHED BY PATCH SET] #####: 1515: continue;
> -[TOUCHED BY PATCH SET] #####: 1526: continue;
> ======== src/backend/optimizer/util/pathnode.c ========
> -[TOUCHED BY PATCH SET] #####: 3433: break;
> -[TOUCHED BY PATCH SET] #####: 3435: return NULL;

Two obvious questions:

1. What are we missing in the ~90 lines of non-covered code, and are
there bugs lurking there?

First, here's an easier to read report than the one I posted earlier.
It's based on the whole patch stack (including the extra tests) from
your v33 tarball:

https://codecov.io/gh/postgresql-cfbot/postgresql/commit/19dace6fca0d9c2bca5022158cf28d99aa237550

The main areas of uncovered lines are: code in
get_wholerow_ref_from_convert_row_type() and code that calls it, and
per node type cases in reparameterize_path_by_child(). It seems like
the former could use a test case, and I wonder if there is some way we
could write "flat-copy and then apply recursively to all subpaths"
code like this without having to handle these cases explicitly. There
are a couple of other tiny return cases other than just sanity check
errors which it might be interesting to hit too.

2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit?

I thought about how to answer questions like this and came up with a
shell script that (1) makes computers run really hot for quite a long
time and (2) tells you which blocks of SQL hit which lines of C.
Please find attached the shell script and its output. The .sql files
have been annotated with "block" numbers (blocks being chunks of SQL
stuff separated by blank lines), and the C files annotated with
references to those block numbers where A<n> = block <n>
partition_join.sql and B<n> = block <n> in partition_join_extras.sql.

Then to find lines that B queries hit but A queries don't and know
which particular queries hit them, you might use something like:

grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
grep 'SQL blocks: .*B[0-9]'

(Off topic but by way of explanation: the attachment name ending
.tarball.gz avoids .tgz or .tar.gz so my stupid cfbot doesn't think
it's a new patch set. I need to figure something better out for
that...)

--
Thomas Munro
http://www.enterprisedb.com

Attachment	Content-Type	Size
coverage.tarball.gz	application/x-gzip	570.2 KB
blame_coverage_on_queries.sh	application/x-sh	3.3 KB

From:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-20 08:33:25
Message-ID:	CAOGQiiM7AtKBO5JXwjS4NU6ZuaPSFWF+axVih6+j-6pfs3LvoQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 19, 2017 at 3:50 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> Rafia Sabih wrote:
>
>> On completing the benchmark for all queries for the above mentioned
>> setup, following performance improvement can be seen,
>> Query | Patch | Head
>> 3 | 1455 | 1631
>> 4 | 499 | 4344
>> 5 | 1464 | 1606
>> 10 | 1475 | 1599
>> 12 | 1465 | 1790
>>
>> Note that all values of execution time are in seconds.
>> To summarise, apart from Q4, all other queries are showing somewhat
>> 10-20% improvement.
>
> Saving 90% of time on the slowest query looks like a worthy improvement
> on its own right. However, you're reporting execution time only, right?
> What happens to planning time? In a quick look,

Definitely. The planning time issue has been discussed upthread,

On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih
<rafia(dot)sabih(at)enterprisedb(dot)com> wrote:

> Another minor thing to note that is planning time is almost twice with
> this patch, though I understand that this is for scenarios with really
> big 'big data' so this may not be a serious issue in such cases, but
> it'd be good if we can keep an eye on this that it doesn't exceed the
> computational bounds for a really large number of tables..

To which Robert replied as,

Yes, this is definitely going to use significant additional planning
time and memory. There are several possible strategies for improving
that situation, but I think we need to get the basics in place first.
That's why the proposal is now to have this turned off by default.
People joining really big tables that happen to be equipartitioned are
likely to want to turn it on, though, even before those optimizations
are done.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-20 09:43:40
Message-ID:	CAFjFpRcQV74TMERjbikBrY7TLVb9LSLXRusn9MJtfiwPbw3S6w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> The main areas of uncovered lines are: code in
> get_wholerow_ref_from_convert_row_type() and code that calls it, and
> per node type cases in reparameterize_path_by_child(). It seems like
> the former could use a test case, and I wonder if there is some way we
> could write "flat-copy and then apply recursively to all subpaths"
> code like this without having to handle these cases explicitly. There
> are a couple of other tiny return cases other than just sanity check
> errors which it might be interesting to hit too.

Under the debugger I checked that the test in partition_join.sql
-- left outer join, with whole-row reference
EXPLAIN (COSTS OFF)
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
covers get_wholerow_ref_from_convert_row_type(). But it doesn't cover
a couple of lines in the case of nested ConvertRowTypeExpr in that
function. We can add/modify a testcase in multi-level partitioned
table section to cover that.

reparameterize_path_by_child() coverage is hard. It would require that
many different kinds of paths survive in lower joins in the join tree.
It's hard to come up with queries that would do that with limited
amount of data and a reasonable number of tests. Me and Thomas
discussed about his suggestion about "flat-copy and then apply
recursively to all subpaths" which he sees as a path tree mutator. It
won't improve the test coverage. Like expression_tree_mutator() path
mutation is not that widely used phenomenon, so we do not yet know
what should be the characteristics of a path mutator could be. In case
we see more of path mutation code in future, it's an idea worth
considering.

>
> 2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit?
>
> I thought about how to answer questions like this and came up with a
> shell script that (1) makes computers run really hot for quite a long
> time and (2) tells you which blocks of SQL hit which lines of C.
> Please find attached the shell script and its output. The .sql files
> have been annotated with "block" numbers (blocks being chunks of SQL
> stuff separated by blank lines), and the C files annotated with
> references to those block numbers where A<n> = block <n>
> partition_join.sql and B<n> = block <n> in partition_join_extras.sql.
>
> Then to find lines that B queries hit but A queries don't and know
> which particular queries hit them, you might use something like:
>
> grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
> grep 'SQL blocks: .*B[0-9]'
>

Thanks for this. It generates a lot of output (970 lines over all the
coverage files). It will take some time for getting anything
meaningful out of this. May be there's some faster way by looking at
the lines that are covered by B but not A. BTW, I checked those lines
to see if there could be any bug there. But I don't see what could go
wrong with those lines.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-20 12:55:47
Message-ID:	CAFjFpRdY9HvDoV-d7iXSZA1GwKuYudwo2-9OLs4tnJy4Ka6K0g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>
>>>> - I'm not entirely sure whether maintaining partexprs and
>>>> nullable_partexprs is the right design. If I understand correctly,
>>>> whether or not a partexpr is nullable is really a per-RTI property,
>>>> not a per-expression property. You could consider something like
>>>> "Relids nullable_rels".
>>>
>>> That's true. However in order to decide whether an expression falls on
>>> nullable side of a join, we will need to call pull_varnos() on it and
>>> check the output against nullable_rels. Separating the expressions
>>> themselves avoids that step.
>>
>> Good point. Also, I'm not sure about cases like this:
>>
>> SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
>> a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
>>
>> Suppose the relations are all partitioned by (x, y) but that the =
>> operator is not strict. A partition-wise join is valid between a and
>> b, but we can't regard w as partitioned any more, because w.x might
>> contain nulls in partitions where the partitioning scheme wouldn't
>> allow them. On the other hand, if the subquery were to select a.x,
>> a.y then clearly it would be fine: there would be no possibility of a
>> NULL having been substituted for a proper value.
>>
>> What if the subquery selected a.x, b.y? Initially, I thought that
>> would be OK too, because of the fact that the a.y = b.y clause is in
>> the WHERE clause rather than the join condition. But on further
>> thought I think that probably doesn't work, because with = being a
>> non-strict operator there's no guarantee that it would remove any
>> nulls introduced by the left join. Of course, if the subselect had a
>> WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
>> list mention those columns would be fine.
>>
>

In my previous reply to this, I probably didn't answer your question
while I explained the restriction on where equality conditions on
partition keys can appear. Here's answer to your questions assuming
those restrictions don't exist. Actually in the example you have
given, optimizer flattens w as a LJ b which kind of makes the
explanations below a bit complicated.

1. SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not between w
and c for the reasons you have explained above.
2. SELECT * FROM (SELECT a.x, a.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b and also between
w and c for the reasons you have explained above.
3. SELECT * FROM (SELECT a.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not w and c
as you have explained.

In this case b.x and b.y will appear as nullable_partexprs in w
(represented as a LJ b in optimizer) and a.x and a.y will appear in
partexprs. Depending upon what gets projected out of w, the join
between w and c will use corresponding keys for equality conditions.
Since the operator is non-strict, any expression which is part of
nullable_partexprs will be discarded in
match_expr_to_partition_keys().

Hope that helps.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-20 13:21:07
Message-ID:	CAM2+6=U9P8ED2gf5_AA+1b-bDSo0eik31fAb8PXNU6gVZPS+Sw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
> > <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> >> partition pruning might need partexprs look up relevant quals, but
> >> nullable_partexprs doesn't have any use there. So may be we should add
> >> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
> >> implementation) instead of 0001. What do you think?
> >
> > +1.
>
> Done.
>
> >
> >>> - I'm not entirely sure whether maintaining partexprs and
> >>> nullable_partexprs is the right design. If I understand correctly,
> >>> whether or not a partexpr is nullable is really a per-RTI property,
> >>> not a per-expression property. You could consider something like
> >>> "Relids nullable_rels".
> >>
> >> That's true. However in order to decide whether an expression falls on
> >> nullable side of a join, we will need to call pull_varnos() on it and
> >> check the output against nullable_rels. Separating the expressions
> >> themselves avoids that step.
> >
> > Good point. Also, I'm not sure about cases like this:
> >
> > SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
> > a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
> >
> > Suppose the relations are all partitioned by (x, y) but that the =
> > operator is not strict. A partition-wise join is valid between a and
> > b, but we can't regard w as partitioned any more, because w.x might
> > contain nulls in partitions where the partitioning scheme wouldn't
> > allow them. On the other hand, if the subquery were to select a.x,
> > a.y then clearly it would be fine: there would be no possibility of a
> > NULL having been substituted for a proper value.
> >
> > What if the subquery selected a.x, b.y? Initially, I thought that
> > would be OK too, because of the fact that the a.y = b.y clause is in
> > the WHERE clause rather than the join condition. But on further
> > thought I think that probably doesn't work, because with = being a
> > non-strict operator there's no guarantee that it would remove any
> > nulls introduced by the left join. Of course, if the subselect had a
> > WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
> > list mention those columns would be fine.
> >
>
> I am actually not sure whether we can use partition-wise join for a
> LEFT JOIN b when the partition key equalities are spread across ON and
> WHERE clauses. I am not able to find any example against it, but I am
> not able to prove it as well. The reference I used for partition-wise
> join [1], mentions JOIN conditions i.e. ON clause conditions. But all
> the examples used in that paper are that of INNER join. So, I am not
> sure what exactly the authors meant by JOIN conditions. Right now I am
> restricting the patch to work with only conditions in the ON clause.
>
> Practically most of the operators are strict. OUTER join's WHERE
> clause has any partition key equality with strict operator, optimizer
> will turn
> that OUTER join into an INNER one, turning all clauses into join
> clauses. That will enable partition-wise join. So, the current
> restriction doesn't restrict any practical cases.
>
> OTOH, I have seen that treating ON and WHERE clauses as same for an
> OUTER join leads to surprising results. So, I am leaning to treat them
> separate for partition-wise join as well and only use ON clause
> conditions for partition-wise join. If we get complaints about
> partition-wise join not being picked we will fix them after proving
> that it's not harmful. Lifting that restriction is not so difficult.
> have_partition_key_equijoin() ignores "pushed down" quals. We have to
> just change that condition.
>
> Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL
> is interesting. If those conditions are in ON clause, we may still
> have a result where b.x and b.y as NULL when no row in "a" matches a
> row in "b". If those conditions are in WHERE clause, I think optimizer
> will turn the join into an INNER join irrespective of whether the
> equality operator is strict.
>
> >
> >> If partition-wise join is disabled, partition-wise aggregates,
> >> strength reduction of MergeAppend won't be possible on a join tree,
> >> but those will be possible on a base relation. Even if partition-wise
> >> join enabled, one may want to disable other partition-wise
> >> optimizations individually. So, they are somewhat independent
> >> switches. I don't think we should bundle all of those into one.
> >> Whatever names we choose for those GUCs, I think they should have same
> >> naming convention e.g. "partition_wise_xyz". I am open to suggestions
> >> about the names.
> >
> > I think the chances of you getting multiple GUCs for different
> > partition-wise optimizations past Tom are pretty low.
>
> We do have enable_hashjoin and enable_hashagg to control use of
> hashing for aggregate and join. On similar lines we can have three
> GUCs to enable use of partition-wise strategy, one for each of join,
> aggregation and sorting. Having granular switches would be useful for
> debugging and may be to turn partition-wise strategies off when they
> are not optimal.

I think having a granular control over each of these optimization will be
handy for the DBAs too.

> Do we want a switch to turn ON/OFF partition pruning?
> Said, that I am fine with single GUC controlling all. We won't set any
> partitioning information in RelOptInfo if that GUC is turned OFF.
>
> [1] https://pdfs.semanticscholar.org/27c2/ba75f8b6a39d4bce85d5579dace609
> c9abaa.pdf
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

From:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-20 14:25:53
Message-ID:	CAKcux6n24b8pEXBzMoUH6mFsCArVgWwtgv-se7+M=rTOXA7Ksg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 20, 2017 at 3:13 PM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > 2. What queries in the 0008 patch are hitting lines that 0007 doesn't
> hit?
> >
> > I thought about how to answer questions like this and came up with a
> > shell script that (1) makes computers run really hot for quite a long
> > time and (2) tells you which blocks of SQL hit which lines of C.
> > Please find attached the shell script and its output. The .sql files
> > have been annotated with "block" numbers (blocks being chunks of SQL
> > stuff separated by blank lines), and the C files annotated with
> > references to those block numbers where A<n> = block <n>
> > partition_join.sql and B<n> = block <n> in partition_join_extras.sql.
> >
> > Then to find lines that B queries hit but A queries don't and know
> > which particular queries hit them, you might use something like:
> >
> > grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
> > grep 'SQL blocks: .*B[0-9]'
> >
>
> Thanks for this. It generates a lot of output (970 lines over all the
> coverage files). It will take some time for getting anything
> meaningful out of this. May be there's some faster way by looking at
> the lines that are covered by B but not A. BTW, I checked those lines
> to see if there could be any bug there. But I don't see what could go
> wrong with those lines.
>
> I have also tried to find test cases in B which hits some extra line which
is not
hitting by A, with the help of results attached by Thomas in
coverage.tarball_FILES.
It took lot of time but I am able to find some test cases. which if adding
in partition_join.sql
increasing no of lines hit by 14. but for hitting these 14 extra line
attached patch is doing
900+ line inserts in partition_join.sql and partition_join.out file.

I have used gcov-lcov to find coverage for files changed by
partition-wise-join patches
with and without attached patch which is below.

*with existing partition_join.sql* *partition_join.sql + some test cases of
partition_join_extra.sql*
*Modifed Files* *Line Coverage* *Functions* *Line Coverage* *Functions*
src/backend/optimizer/geqo 79.4 % 269/339 96.6 % 28/29 79.4 % 269/339 96.6 %
28/29
src/backend/optimizer/path/allpaths.c 92.3 % 787 / 853 95.5 % 42 / 44
92.6 % 790
/ 853 95.5 % 42 / 44
src/backend/optimizer/path/costsize.c 96.8 % 1415 / 1462 98.4 % 61 / 62
96.9 % 1416 / 1462 98.4 % 61 / 62
src/backend/optimizer/path/joinpath.c 95.5 % 404 / 423 100.0 % 16 / 16
95.5 % 404 / 423 100.0 % 16 / 16
src/backend/optimizer/path/joinrels.c 92.5 % 422 / 456 100.0 % 16 / 16
93.0 % 424 / 456 100.0 % 16 / 16
src/backend/optimizer/plan/createplan.c 90.9 % 1928 / 2122 96.3 % 103 / 107
91.0 % 1930 / 2122 96.3 % 103 / 107
src/backend/optimizer/plan/planner.c 94.9 % 1609 / 1696 97.6 % 41 / 42
94.9 % 1609 / 1696 97.6 % 41 / 42
src/backend/optimizer/plan/setrefs.c 91.3 % 806 / 883 94.3 % 33 / 35 91.3 % 806
/ 883 94.3 % 33 / 35
src/backend/optimizer/prep/prepunion.c 95.5 % 661 / 692 100.0 % 25 / 25
95.5 % 661 / 692 100.0 % 25 / 25
src/backend/optimizer/util/pathnode.c 88.7 % 1144 / 1290 98.1 % 52 / 53
88.8 % 1146 / 1290 98.1 % 52 / 53
src/backend/optimizer/util/placeholder.c 96.5 % 139 / 144 100.0 % 10 / 10
96.5 % 139 / 144 100.0 % 10 / 10
src/backend/optimizer/util/plancat.c 89.0 % 540 / 607 94.7 % 18 / 19 89.6 % 544
/ 607 94.7 % 18 / 19
src/backend/optimizer/util/relnode.c 95.3 % 548 / 575 100.0 % 24 / 24
95.3 % 548
/ 575 100.0 % 24 / 24
src/backend/utils/misc/guc.c 67.4 % 1536 / 2278 89.7 % 113 / 126 67.4 % 1536
/ 2278 89.7 % 113 / 126

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachment	Content-Type	Size
partition_join_with_some_testcases_from_extra.patch	text/x-patch	54.9 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-21 03:42:46
Message-ID:	CA+TgmoawL8GWM5CcWM_SVb2G5EXkbt0WwKLn5DatTNBctkD9qA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Done.

Committed 0001 with extensive editorialization. I did not think it
was a good idea to include a partition.h a file in src/include/nodes,
so I worked around that. The include of pg_inherits_fn.h was
unneeded. I rewrote a lot of the comments and made some other style
tweaks.

Don't look now, but I think it might be about time for the main act.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-21 12:21:30
Message-ID:	CAFjFpRc4UdCYknBai9pBu2GA1h4nZVNPDmzgs4jOkqFamT1huA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 21, 2017 at 9:12 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Done.
>
> Committed 0001 with extensive editorialization. I did not think it
> was a good idea to include a partition.h a file in src/include/nodes,
> so I worked around that. The include of pg_inherits_fn.h was
> unneeded. I rewrote a lot of the comments and made some other style
> tweaks.

Thanks a lot Robert. Thanks for changing comments to be more precise and crisp.

Here's set of rebased patches. The patch with extra tests is not for
committing. All other patches, except the last one, will need to be
committed together. The last patch may be committed along with other
patches or as a separate patch.

About your earlier comment of making build_joinrel_partition_info()
simpler. Right now, the code assumes that partexprs or
nullable_partexpr can be NULL when either of them is not populated.
That may be saves a sizeof(pointer) * (number of keys) byes of memory.
Saving that much memory may not be worth the complexity of code. So,
we may always allocate memory for those arrays and fill it with NIL
values when there are no key expressions to populate those. That will
simplify the code. I haven't done that change in this patchset. I was
busy debugging the Q7 regression. Let me know your comments about
that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v34.tar.gz	application/x-gzip	123.1 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-21 12:52:13
Message-ID:	CA+TgmoYf4Jz94nHK0=1se9ZsyRzVOoOpWFXyeb-tyWqYepfwSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> About your earlier comment of making build_joinrel_partition_info()
> simpler. Right now, the code assumes that partexprs or
> nullable_partexpr can be NULL when either of them is not populated.
> That may be saves a sizeof(pointer) * (number of keys) byes of memory.
> Saving that much memory may not be worth the complexity of code. So,
> we may always allocate memory for those arrays and fill it with NIL
> values when there are no key expressions to populate those. That will
> simplify the code. I haven't done that change in this patchset. I was
> busy debugging the Q7 regression. Let me know your comments about
> that.

Hmm, I'm not sure that's the best approach, but let me look at it more
carefully before I express a firm opinion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-21 13:07:25
Message-ID:	CAFjFpRee-0oj4Fq1RjZ78fiwhfSUFAvMa05WtAUtUUCjXOpH9A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 18, 2017 at 10:18 AM, Rafia Sabih
<rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>>
>
> Limit (cost=83341943.28..83341943.35 rows=1 width=92) (actual
> time=1556989.996..1556989.997 rows=1 loops=1)
> -> Finalize GroupAggregate (cost=83341943.28..83342723.24
> rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1
> loops=1)
> Group Key: n1.n_name, n2.n_name, (date_part('year'::text,
> (lineitem_001.l_shipdate)::timestamp without time zone))
> -> Sort (cost=83341943.28..83342043.92 rows=40256 width=92)
> (actual time=1556989.910..1556989.911 rows=6 loops=1)
> Sort Key: n1.n_name, n2.n_name,
> (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
> time zone))
> Sort Method: quicksort Memory: 27kB
> -> Gather (cost=83326804.81..83338864.31 rows=40256
> width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1)
> Workers Planned: 4
> Workers Launched: 4
>
> AFAICU the node above sort is group-aggregate and then there is limit,
> and the number of rows for sort node in explain analyse is returned
> number of rows. So, what is happening here is once one group is
> completed it is aggregated and fetched by limit, now there is no need
> for sort to return any more rows and hence the result.

Thanks for your explanation. That makes sense. I forgot about LIMIT node on top.

I debugged the plans today and performed some experiments. Here are my
observations

The join order with and without partition-wise join changes. Without
partition-wise join it is
(lineitem, (suppliers, nation1)), (orders, (customer, nation2)). The
join (lineitem, (suppliers, nation1)) is executed by one gather node
and (orders, (customer, nation2)) is executed by other. Thus the plan
has two gather nodes, which feed to the topmost join.
With partition-wise join the join order is ((lineitem, orders),
(supplier, nation1)), (customer, nation2). The join (lineitem, orders)
uses partition-wise join. This plan executes the whole join tree along
with partial group aggregation under a gather merge.

The rows estimated for various nodes under Gather/GatherMerge are
different from the actual rows e.g.
-> Hash Join (cost=113164.47..61031454.40 rows=10789501 width=46)
(actual time=3379.931..731987.943 rows=8744357 loops=5) (in
non-partition-wise join plan) OR
-> Append (cost=179532.36..80681785.95 rows=134868761 width=24)
(actual time=9437.573..1360219.567 rows=109372134 loops=5) (in
partition-wise join plan).
I first thought that this is a real estimation error and spent some
time investigating the estimation error. But eventually realised that
this is how a parallel query plan reports, when I saw that Gather node
estimated correct number of rows even though the nodes under it showed
this difference. Here's the explanation of this report. There are 4
parallel workers, so, the leaders contribution would be estimated to
be 0 by get_parallel_divisor(). So these estimates are per worker and
so the total estimated rows produced by any of the nodes is 4 times
the reported. But when the query actually runs, the leader also
participates, so number of loops = 5 and the actual rows reported are
(total actual rows) / (number of loops i.e. number of backends that
executed the query). The total estimates rows and total actual rows
are roughly equal. So there's no real estimation error, as I thought
earlier. May be we want to make EXPLAIN (ANALYZE) output easier to
understand.

When I tried the same query on laptop with scale 20, I found that the
leader is really contributing as much as other workers. So, the
partial paths were really created based on an estimate which was 20%
off. The cost difference between partition-wise join plan and
non-partition-wise join plan is hardly 1.5%. So, it's possible that if
we correct this estimation error, partition-wise join plan won't be
chosen because of it will have a higher cost. Remember there are two
gather nodes in non-partition-wise join plan and partition-wise join
plan has one gather. So, non-partition-wise join path gets the 20%
decreased estimates twice and partition-wise join gets it only once.

The explain (analyze, verbose) of a parallel node looks like
-> Parallel Seq Scan on public.lineitem_002 (cost=0.00..168752.99
rows=573464 width=24) (actual time=1.395..3075.485 rows=454464
loops=5)
Filter:
((lineitem_002.l_shipdate >= '1995-01-01'::date) AND
(lineitem_002.l_shipdate <= '1996-12-31'::date))
Rows Removed by Filter: 1045065
Worker 0: actual
time=3.358..3131.426 rows=458267 loops=1
Worker 1: actual
time=0.860..3146.282 rows=447231 loops=1
Worker 2: actual
time=1.317..3123.646 rows=489960 loops=1
Worker 3: actual
time=0.927..3130.497 rows=475545 loops=1
If we sum the rows returned by each worker they don't add up to
(actual rows) * (actual loops). So I assumed that the unreported
number of rows were processed by the leader. Is that right?

I might be misunderstanding how parallel query works, but here's my
analysis so far. I will continue investigating further.

Any clues would be helpful.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-22 05:15:47
Message-ID:	CAOGQiiP7dfdG4JgCtnJMz-ww0a15NitdjF6qjxV7SWmno6DMpQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 19, 2017 at 2:58 PM, Rafia Sabih
<rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
> On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
> <rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>>>
>> On TPC-H benchmarking of this patch, I found a regression in Q7. It
>> was taking some 1500s with the patch and some 900s without the patch.
>> Please find the attached pwd_reg.zip for the output of explain analyse
>> on head and with patch.
>>
>> The experimental settings used were,
>> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
>> patch-version = v26
>>
>> Server settings:
>> work_mem = 1GB
>> shared_buffers = 10GB
>> effective_cache_size = 10GB
>> max_parallel_workers_per_gather = 4
>>
>> Partitioning information:
>> Partitioning scheme = by range
>> Number of partitions in lineitem and orders table = 106
>> partition key for lineitem = l_orderkey
>> partition key for orders = o_orderkey
>>
>> Apart from these there is a regression case on a custom table, on head
>> query completes in 20s and with this patch it takes 27s. Please find
>> the attached .out and .sql file for the output and schema for the test
>> case respectively. I have reported this case before (sometime around
>> March this year) as well, but I am not sure if it was overlooked or is
>> an unimportant and expected behaviour for some reason.
>>
>
> On completing the benchmark for all queries for the above mentioned
> setup, following performance improvement can be seen,
> Query | Patch | Head
> 3 | 1455 | 1631
> 4 | 499 | 4344
> 5 | 1464 | 1606
> 10 | 1475 | 1599
> 12 | 1465 | 1790
>
> Note that all values of execution time are in seconds.

I compared this experiment with non-partitioned database and following
is the result,
Query | Non-partitioned head
3 | 1752
4 | 315
5 | 2319
10 | 1535
12 | 1739

In summary, the query that appears slowest in partitioned database is
not so otherwise. It is good to see that in Q4 partition-wise join
helps in achieving performance closer to it's non-partitioned case,
otherwise partitioning alone causes it to suffer greatly. Apart from
Q4 it does not looks like partitioning hurts anywhere else, though the
maximum improvement is ~35% for Q5.
Another point to note here is that the performance on partitioned and
unpartitioned heads are quite close (except Q4) which is something
atleast I wasn't expecting. It looks like we need not to partition the
tables anyway, or atleast this set of queries doesn't benefit from
partitioning. Please let me know if somebody has better ideas on how
partitioning schemes should be applied to make it more beneficial for
these queries.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachment	Content-Type	Size
pg_unpart.zip	application/zip	27.9 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-22 08:33:14
Message-ID:	CAFjFpRcuw4TtJsL+kAhF+GRz7Vv1MW=pnmhw2Zp2dAm3kcCh6A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 22, 2017 at 10:45 AM, Rafia Sabih
<rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>>
>> On completing the benchmark for all queries for the above mentioned
>> setup, following performance improvement can be seen,
>> Query | Patch | Head
>> 3 | 1455 | 1631
>> 4 | 499 | 4344
>> 5 | 1464 | 1606
>> 10 | 1475 | 1599
>> 12 | 1465 | 1790
>>
>> Note that all values of execution time are in seconds.
>
> I compared this experiment with non-partitioned database and following
> is the result,
> Query | Non-partitioned head
> 3 | 1752
> 4 | 315
> 5 | 2319
> 10 | 1535
> 12 | 1739
>
> In summary, the query that appears slowest in partitioned database is
> not so otherwise. It is good to see that in Q4 partition-wise join
> helps in achieving performance closer to it's non-partitioned case,
> otherwise partitioning alone causes it to suffer greatly. Apart from
> Q4 it does not looks like partitioning hurts anywhere else, though the
> maximum improvement is ~35% for Q5.
> Another point to note here is that the performance on partitioned and
> unpartitioned heads are quite close (except Q4) which is something
> atleast I wasn't expecting. It looks like we need not to partition the
> tables anyway, or atleast this set of queries doesn't benefit from
> partitioning. Please let me know if somebody has better ideas on how
> partitioning schemes should be applied to make it more beneficial for
> these queries.

Just partitioning is not expected to improve query performance (but we
still see some performance improvement). Partitioning + partition-wise
operations, pruning is expected to show performance gains. IIUC the
results you reported, Q3 takes 1752 seconds with non-partitioned head,
with partitioning it completes in 1631 seconds and with partition-wise
join it completes in 1455, so net improvement because of partitioning
is 300 seconds is almost 16% improvement, which is a lot for very
large data. So, except Q4, every query improves when the tables are
partitioned. Am I interpreting the results correctly?

There may be some other way of partitioning, which may give better
results, but I think what we have now shows the importance of
partitioning in case of very large data e.g. scale 300 TPCH.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-09-22 13:06:52
Message-ID:	CAFjFpRfAreXz2s+qOEjjPM7p5x_=GZLU474diNgrAgDZ9rA62g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>
>>
>> Apart from these there is a regression case on a custom table, on head
>> query completes in 20s and with this patch it takes 27s. Please find
>> the attached .out and .sql file for the output and schema for the test
>> case respectively. I have reported this case before (sometime around
>> March this year) as well, but I am not sure if it was overlooked or is
>> an unimportant and expected behaviour for some reason.
>>
>
> Are you talking about [1]? I have explained about the regression in
> [2] and [3]. This looks like an issue with the existing costing model.
>

I debugged this case further. There are two partitioned tables being
joined prt (with partitions prt_p1, prt_p2 and so on) and prt2 (with
partitions prt2_p1, prt2_p2, and so on). When join is executed without
partition-wise join, prt2 is used to build hash table and prt is used
to probe that hash table. prt2 has lesser number of rows than prt. But
when partition-wise join is used, individual partitions are joined in
reverse join order i.e. partitions of prt are used to build the hash
table and partitions of prt2 are used to probe. This happens because
the path for the other join order (partition of prt2 used to build the
hash table and partition of prt used to probe) has huge cost compared
to the first one (74459 and 313109) and a portion worth 259094 comes
from lines 3226/7 of final_cost_hashjoin()
3215 /*
3216 * The number of tuple comparisons needed is the number of outer
3217 * tuples times the typical number of tuples in a hash
bucket, which
3218 * is the inner relation size times its bucketsize
fraction. At each
3219 * one, we need to evaluate the hashjoin quals. But actually,
3220 * charging the full qual eval cost at each tuple is pessimistic,
3221 * since we don't evaluate the quals unless the hash values match
3222 * exactly. For lack of a better idea, halve the cost estimate to
3223 * allow for that.
3224 */
3225 startup_cost += hash_qual_cost.startup;
3226 run_cost += hash_qual_cost.per_tuple * outer_path_rows *
3227 clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

That's because for some reason innerbucketsize for partition of prt is
22 times more than that for partition of prt2. Looks like we have some
estimation error in estimating bucket sizes.

If I force partitions to be joined with the same order as partitioned
tables (without partition-wise join), child-joins execute faster and
in turn partition-wise join performs better than the
non-partition-wise join. So, this is clearly some estimation and
costing problem with regular joins.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-03 02:18:11
Message-ID:	CA+Tgmoa1icxn-+XMT1HGm1ogY6z8gcYqfrOAfbyyjp-RsaaY3A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Here's set of rebased patches. The patch with extra tests is not for
> committing. All other patches, except the last one, will need to be
> committed together. The last patch may be committed along with other
> patches or as a separate patch.

In set_append_rel_size, is it necessary to set attr_needed =
bms_copy(rel->attr_needed[index]) rather than just pointing to the
existing value? If so, perhaps the comments should explain the
reasons. I would have thought that the values wouldn't change after
this point, in which case it might not be necessary to copy them.

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

The elog(DEBUG3) in try_partition_wise_join() doesn't follow message
style guidelines and I think should just be removed. It was useful
for development, I'm sure, but it's time for it to go.

+ elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path));

I think we should use the same formulation as elsewhere, namely
"unrecognized node type: %d". And likewise probably "unexpected join
type: %d".

partition_join_extras.sql has a bunch of whitespace damage, although
it doesn't really matter since, as you say, that's not for commit.

(This is not a full review, just a few thoughts.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-03 12:57:05
Message-ID:	CAFjFpRcpbMmsKv_eCn6SNoiVrNi=y4RXfE8b3UH=ZZOEDL_w2g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 3, 2017 at 7:48 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Here's set of rebased patches. The patch with extra tests is not for
>> committing. All other patches, except the last one, will need to be
>> committed together. The last patch may be committed along with other
>> patches or as a separate patch.
>
> In set_append_rel_size, is it necessary to set attr_needed =
> bms_copy(rel->attr_needed[index]) rather than just pointing to the
> existing value? If so, perhaps the comments should explain the
> reasons. I would have thought that the values wouldn't change after
> this point, in which case it might not be necessary to copy them.

Right. The only places where attr_needed is changed is in
remove_rel_from_query() (useless join removal) and
add_vars_to_targetlist(). Both of those happen before
set_append_rel_size(). Since parent and child join should project same
attributes, having them share the Relids set makes more sense. So,
changed accordingly and explained the same in comments.

Also, changed list_nth() in the following code block to use list_nth_node().

>
> Regarding nomenclature and my previous griping about wisdom, I was
> wondering about just calling this a "partition join" like you have in
> the regression test. So the GUC would be enable_partition_join, you'd
> have generate_partition_join_paths(), etc. Basically just delete
> "wise" throughout.

Partition-wise join is standard term used in literature and in
documentation of other popular DBMSes, so partition_wise makes more
sense. But I am fine with partition_join as well. Do you want it
partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
etc.?

>
> The elog(DEBUG3) in try_partition_wise_join() doesn't follow message
> style guidelines and I think should just be removed. It was useful
> for development, I'm sure, but it's time for it to go.

Done.

>
> + elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path));
>
> I think we should use the same formulation as elsewhere, namely
> "unrecognized node type: %d". And likewise probably "unexpected join
> type: %d".

Changed "unrecognized path node type" to "unrecognized node type".

"unrecognized join type: %d" seems to be used everywhere except
postgres_fdw. So, used that. Also added a cast to int similar to other
places.

>
> partition_join_extras.sql has a bunch of whitespace damage, although
> it doesn't really matter since, as you say, that's not for commit.
>

Right. I will remove that patch from the patch-set when those tests
are no more needed i.e. once we are done with code changes to the
patches.

Attached the updated patch-set.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v35.tar.gz	application/x-gzip	123.0 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-03 19:27:14
Message-ID:	CA+TgmoYfD00O908MCbxPzMSjrmemm5Lo8yW41S8EdFy81p8w7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test. So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc. Basically just delete
>> "wise" throughout.
>
> Partition-wise join is standard term used in literature and in
> documentation of other popular DBMSes, so partition_wise makes more
> sense. But I am fine with partition_join as well. Do you want it
> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
> etc.?

Well, you're making me have second thoughts. It's really just that
partition_wise looks a little awkward to me, and maybe that's not
enough reason to change anything. I suppose if I commit it this way
and somebody really hates it, it can always be changed later. We're
not getting a lot of input from anyone else at the moment.

> Attached the updated patch-set.

I decided to skip over 0001 for today and spend some time looking at
0002-0006. Comments below.

0002:

Looks fine.

0003:

The commit message mentions estimate_num_groups but the patch doesn't touch it.

I am concerned that this patch might introduce some problem fixed by
commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that
patch say, at one place, that "This protects against possible
incorrect matches to child expressions that contain no Vars."
However, if a child expression has no Vars, then I think em->em_relids
will be empty, so the bms_is_equal() test that is there now will fail
but your proposed bms_is_subset() test will pass.

0004:

I suggest renaming get_wholerow_ref_from_convert_row_type to
is_converted_whole_row_reference and making it return a bool.

The coding of that function is a little strange; why not move Var to
an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var =
castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

Will the statement that "In case of multi-level partitioning, we will
have as many nested ConvertRowtypeExpr as there are levels in
partition hierarchy" be falsified by Amit Khandekar's pending patch to
avoid sticking a ConvertRowTypeExpr on top of another
ConvertRowTypeExpr? Even if the answer is "no", I think it might be
better to drop this part of the comment; it would be easy for it to
become false in the future, because we might want to optimize that
case in the future and we'll probably forget to update this comment
when we do.

In fix_upper_expr_mutator(), you have an if statement whose entire
contents are another if statement. I think you should use && instead,
and maybe reverse the order of the tests, since
context->subplan_itlist->has_conv_whole_rows is probably cheaper to
test than a function call. It's also a little strange that this code
isn't adjacent too, or merged with, the existing has_non_vars case.
Maybe:

converted_whole_row = is_converted_whole_row_reference(node);
if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
(context->outer_itlist->has_conv_whole_rows && converted_whole_row))
...
if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
(context->inner_itlist->has_conv_whole_rows && converted_whole_row))
...

0005:

The comment explaining why the ParamPathInfo is allocated in the same
context as the RelOptInfo is a modified copy of an existing comment
that still reads like the original, a manner of commenting I find a
bit undesirable as it leads to filling up the source base with
duplicate comments.

I don't think I believe that comment, either. In the case from which
that comment was copied (mark_dummy_rel), it was talking about a
RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
to joinrels creating during a GEQO cycle. But there's no similar
logic for ppilist, so I think what will happen here is that you'll end
up with a freed node in the middle of the list.

I think reparameterize_path_by_chid() could use a helper function
reparameterize_pathlist_by_child() that iterates over a list of paths
and returns a list of paths. That would remove some of the loops.

I think the comments for reparameterize_path_by_child() need to be
expanded. They don't explain how you decided which nodes need to be
handled here or which fields within those nodes need some kind of
handling other than a flat-copy. I think these kinds of explanations
will be important for future maintenance of this code. You know why
you did it this way, I can mostly guess what you did it this way, but
what about the next person who comes along who hasn't made a detailed
study of partition-wise join?

I don't see much point in the T_SubqueryScanPath and T_ResultPath
cases in reparameterize_path_by_child(). It's just falling through to
the default case.

I wonder if reparameterize_path_by_child() ought to default to
returning NULL rather than throwing an error; the caller would then
have to be prepared for that and skip building the path. But that
would be more like what reparameterize_path() does, and it would make
failure to include some relevant path type here a corner-case
performance bug rather than a correctness issue. It seems like
someone adding a new path type could quite easily fail to realize that
it might need to be added here, or might be unsure whether it's
necessary to add it here.

0006:

I have some doubts about how stable all of the EXPLAIN outputs are
going to be on the buildfarm. I'm not sure what we can really do
about that in advance of trying them, but it's a lot of EXPLAIN
output. If you have an ideas about how to tighten it up without
losing test coverage, that would be good. For example, maybe the
"full outer join" case isn't needed given the following test case
which is also a full outer join but which covers additional behavior.

I think it would be good to have a test case that shows multi-level
partition-wise join working across multiple levels. I wrote the
attached test, which you're welcome to use if you like it, adapt if
you sorta like it, or replace if you dislike it. The table names at
least should be changed to something less likely to duplicate other
tests.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
mlpartjoin.sql	application/octet-stream	2.2 KB

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-04 04:37:42
Message-ID:	163f9f69-563a-6d03-3e51-2e41703c28dc@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017/10/04 4:27, Robert Haas wrote:
> On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> Regarding nomenclature and my previous griping about wisdom, I was
>>> wondering about just calling this a "partition join" like you have in
>>> the regression test. So the GUC would be enable_partition_join, you'd
>>> have generate_partition_join_paths(), etc. Basically just delete
>>> "wise" throughout.
>>
>> Partition-wise join is standard term used in literature and in
>> documentation of other popular DBMSes, so partition_wise makes more
>> sense. But I am fine with partition_join as well. Do you want it
>> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
>> etc.?
>
> Well, you're making me have second thoughts. It's really just that
> partition_wise looks a little awkward to me, and maybe that's not
> enough reason to change anything. I suppose if I commit it this way
> and somebody really hates it, it can always be changed later. We're
> not getting a lot of input from anyone else at the moment.

FWIW, the name enable_partition_join seems enough to convey the core
feature, that is, I see "_wise" as redundant, even though I'm now quite
used to seeing "_wise" in the emails here and saying it out loud every now
and then. Ashutosh may have a point though that users coming from other
databases might miss the "_wise". :)

Thanks,
Amit

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-04 12:23:22
Message-ID:	CAFjFpRewe+dp46sOx50XU3HSiQ569DAsuiMZKtZLXL2GVvjicg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 4, 2017 at 12:57 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> 0003:
>
> The commit message mentions estimate_num_groups but the patch doesn't touch it.

This was fixed when we converted many rel->reloptkind ==
RELOPT_BASEREL to IS_SIMPLE_REL(). I have removed this section from
the commit message.

>
> I am concerned that this patch might introduce some problem fixed by
> commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that
> patch say, at one place, that "This protects against possible
> incorrect matches to child expressions that contain no Vars."
> However, if a child expression has no Vars, then I think em->em_relids
> will be empty, so the bms_is_equal() test that is there now will fail
> but your proposed bms_is_subset() test will pass.

bms_is_equal() was enough when there was only a single member in
relids but it doesn't work now that there can be multiple of them.
bms_is_equal() was replaced with bms_is_subset() to accomodate for
ec_members with only a subset of relids when we are searching for a
join relation.

I am not sure whether your assumption that expression with no Vars
would have em_relids empty is correct. I wonder whether we will add
any em_is_child members with empty em_relids; looking at
process_equivalence() those come from RestrictInfo::left/right_relids
which just indicates the relids at which that particular expression
can be evaluated. Place holder vars is an example when that can
happen, but there may be others. To verify this, I tried attached
patch on master and ran make check. The assertion didn't trip. If
em_relids is not NULL, bms_is_subset() is fine.

If em_relids could indeed go NULL when em_is_child is true, passing
NULL relids (for parent rels) to that function can cause unwanted
behaviour. bms_is_equal(em->em_relids, relids) will return true
turning the if (em->em_is_child && !bms_is_equal(em->em_relids,
relids)) to false. This means that we will consider a child member
with em_relids NULL even while matching a parent relation. What
surprises me is, that commit added a bunch of testcases and none of
them failed with this change.

Nonetheless, I have changed "matches" with "belongs to" in the
prologue of those functions since an exact match won't be possible
with child-joins.

>
> 0004:
>
> I suggest renaming get_wholerow_ref_from_convert_row_type to
> is_converted_whole_row_reference and making it return a bool.

Done.

>
> The coding of that function is a little strange; why not move Var to
> an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var =
> castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

I probably went too far to avoid indented code :). Fixed now.

>
> Will the statement that "In case of multi-level partitioning, we will
> have as many nested ConvertRowtypeExpr as there are levels in
> partition hierarchy" be falsified by Amit Khandekar's pending patch to
> avoid sticking a ConvertRowTypeExpr on top of another
> ConvertRowTypeExpr? Even if the answer is "no", I think it might be
> better to drop this part of the comment; it would be easy for it to
> become false in the future, because we might want to optimize that
> case in the future and we'll probably forget to update this comment
> when we do.

That might keep someone wondering where the nested
ConvertRowtypeExpr's came from. But may be in future those can arise
from something other than multi-level partition hierarchy and in that
case too the comment would be rendered inaccurate. So done.

>
> In fix_upper_expr_mutator(), you have an if statement whose entire
> contents are another if statement. I think you should use && instead,
> and maybe reverse the order of the tests, since
> context->subplan_itlist->has_conv_whole_rows is probably cheaper to
> test than a function call. It's also a little strange that this code
> isn't adjacent too, or merged with, the existing has_non_vars case.
> Maybe:
>
> converted_whole_row = is_converted_whole_row_reference(node);
> if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
> (context->outer_itlist->has_conv_whole_rows && converted_whole_row))
> ...
> if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
> (context->inner_itlist->has_conv_whole_rows && converted_whole_row))

I placed it with the other node types since it's for a specific node
type, but I guess your suggestion avoids duplicates and looks better.
Done.

> ...
>
> 0005:
>
> The comment explaining why the ParamPathInfo is allocated in the same
> context as the RelOptInfo is a modified copy of an existing comment
> that still reads like the original, a manner of commenting I find a
> bit undesirable as it leads to filling up the source base with
> duplicate comments.

I have pointed to mark_dummy_rel() in that comment instead of
duplicating the whole paragraph.

>
> I don't think I believe that comment, either. In the case from which
> that comment was copied (mark_dummy_rel), it was talking about a
> RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
> to joinrels creating during a GEQO cycle. But there's no similar
> logic for ppilist, so I think what will happen here is that you'll end
> up with a freed node in the middle of the list.

In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist
with dummy path being created in the same context as the RelOptInfo.
Same applies here. While reparameterizing a path tree, we may reach a
path for a base relation and create a PPI for a base relation. This
may happen when GEQO is planning a join, and thus we are in a
short-lived context created by that GEQO cycle. We don't want a base
rel PPI to be created in that context, so instead we use the context
of base rel itself. Other way round, we don't want to use a longer
context for creating PPI for a join relation when it's created by a
GEQO cycle. So, we use join relation's context.The code doesn't free
up a node in the middle of the list but it avoids such an anomaly. See
[1]
>
> I think reparameterize_path_by_chid() could use a helper function
> reparameterize_pathlist_by_child() that iterates over a list of paths
> and returns a list of paths. That would remove some of the loops.

That's a good idea. Done.

>
> I think the comments for reparameterize_path_by_child() need to be
> expanded. They don't explain how you decided which nodes need to be
> handled here or which fields within those nodes need some kind of
> handling other than a flat-copy. I think these kinds of explanations
> will be important for future maintenance of this code. You know why
> you did it this way, I can mostly guess what you did it this way, but
> what about the next person who comes along who hasn't made a detailed
> study of partition-wise join?

We need to reparameterize any path which contains further paths and/or
contains expressions that point to the parent relation. For a given
path we need to reparameterize any paths that it contains and
translate any expressions that are specific to that path. Expressions
common across the paths are translated after the switch case. I have
added this rule to the comment just above the switch case
/*
* Copy of the given path. Reparameterize any paths referenced by the given
* path. Replace parent Vars in path specific expressions by corresponding
* child Vars.
*/
Does that look fine or we want to add explanation for every node handled here.

>
> I don't see much point in the T_SubqueryScanPath and T_ResultPath
> cases in reparameterize_path_by_child(). It's just falling through to
> the default case.

I added those cases separately to explain why we should not see those
cases in that switch case. I think that explanation is important
(esp. considering your comment above) and associating those comment
with "case" statement looks better. Are you suggesting that we should
add that explanation in default case?

>
> I wonder if reparameterize_path_by_child() ought to default to
> returning NULL rather than throwing an error; the caller would then
> have to be prepared for that and skip building the path. But that
> would be more like what reparameterize_path() does, and it would make
> failure to include some relevant path type here a corner-case
> performance bug rather than a correctness issue. It seems like
> someone adding a new path type could quite easily fail to realize that
> it might need to be added here, or might be unsure whether it's
> necessary to add it here.

I am OK with that. However reparameterize_path_by_child() and
reparameterize_paths_by_child() are callers of
reparameterize_path_by_child() so they will need to deal with NULL
return. I am fine with that too, but making sure that we are on the
same page. If we do that, we could simply assert that the switch case
doesn't see T_SubqueryScanPath and T_ResultPath.

>
> 0006:
>
> I have some doubts about how stable all of the EXPLAIN outputs are
> going to be on the buildfarm. I'm not sure what we can really do
> about that in advance of trying them, but it's a lot of EXPLAIN
> output. If you have an ideas about how to tighten it up without
> losing test coverage, that would be good. For example, maybe the
> "full outer join" case isn't needed given the following test case
> which is also a full outer join but which covers additional behavior.

Yes, I too am thinking about the same. The only reason I have EXPLAIN
output there is to check whether partition-wise join is being used or
not. The testcase is not interested in the actual shape. It doesn't
make sense to just test the output if partition-wise join is not used.
May be a function examining the plan tree would help. The function
will have to handle Result/Sort nodes on top and make sure that Append
has join children. Do you have any other idea to check the shape of
the plan tree without the details? Any EXPLAIN switch, existing
functions etc.?

Removed the extra full outer join testcase.
>
> I think it would be good to have a test case that shows multi-level
> partition-wise join working across multiple levels. I wrote the
> attached test, which you're welcome to use if you like it, adapt if
> you sorta like it, or replace if you dislike it. The table names at
> least should be changed to something less likely to duplicate other
> tests.
>

There are tests for multi-level partitioned table in the file. They
test whole partition hierarchy join, part of it being joined based on
the quals. Search for
--
-- multi-leveled partitions
--

Have you looked at those? They test two-level partitioned tables and
your test tests three-level partitioned table. I can modify the tests
to have three levels of partitions and different partition schemes on
different levels. Is that what you expect?

[1] https://www.postgresql.org/message-id/CAFjFpRcPutbr4nVAsrY-5q%3DwCFrNK25_3MNhHgyYYM0yeOoj%3DQ%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
em_is_child_em_relids.patch	text/x-patch	740 bytes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-04 15:31:21
Message-ID:	CA+TgmoZH_-wQ+3VbES8YYBK6ti_rAayhXaAVX8V9fhkS7Jo8Hg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> About your earlier comment of making build_joinrel_partition_info()
>> simpler. Right now, the code assumes that partexprs or
>> nullable_partexpr can be NULL when either of them is not populated.
>> That may be saves a sizeof(pointer) * (number of keys) byes of memory.
>> Saving that much memory may not be worth the complexity of code. So,
>> we may always allocate memory for those arrays and fill it with NIL
>> values when there are no key expressions to populate those. That will
>> simplify the code. I haven't done that change in this patchset. I was
>> busy debugging the Q7 regression. Let me know your comments about
>> that.
>
> Hmm, I'm not sure that's the best approach, but let me look at it more
> carefully before I express a firm opinion.

Having studied this a bit more, I now think your proposed approach is
a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-04 15:34:06
Message-ID:	CA+Tgmob+Uw==ybejCpm7dkN9FE_81vdAYNirCEq-Fn+0DKoYJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I decided to skip over 0001 for today and spend some time looking at
> 0002-0006.

Back to 0001.

+ Enables or disables the query planner's use of partition-wise join
+ plans. When enabled, it spends time in creating paths for joins between
+ partitions and consumes memory to construct expression nodes to be used
+ for those joins, even if partition-wise join does not result in the
+ cheapest path. The time and memory increase exponentially with the
+ number of partitioned tables being joined and they increase linearly
+ with the number of partitions. The default is <literal>off</>.

I think this is too scary and too much technical detail. I think you
could just say something like: Enables or disables use of
partition-wise join, which allows a join between partitioned tables to
be performed by joining the matching partitions. Partition-wise join
currently applies only when the join conditions include all the
columns of the partition keys, which must be of the same data type and
have exactly matching sets of child partitions. Because
partition-wise join planning can use significantly increase CPU time
and memory usage during planning, the default is <literal>off</>.

+partitioned table. The join partners can not be found in other partitions. This
+condition allows the join between partitioned tables to be broken into joins
+between the matching partitions. The resultant join is partitioned in the same

"The join partners can not be found in other partitions." is redundant
with the previous sentence. I suggest deleting it. I also suggest
"This condition allows the join between partitioned tables to be
broken" -> "Because of this, the join between partitioned tables can
be broken".

+relation" for both partitioned table as well as join between partitioned tables
+which can use partition-wise join technique.

for either a partitioned table or a join between compatibly partitioned tables

+Partitioning properties of a partitioned relation are stored in
+PartitionSchemeData structure. Planner maintains a list of canonical partition
+schemes (distinct PartitionSchemeData objects) so that any two partitioned
+relations with same partitioning scheme share the same PartitionSchemeData
+object. This reduces memory consumed by PartitionSchemeData objects and makes
+it easy to compare the partition schemes of joining relations.

Not all of the partitioning properties are stored in the
PartitionSchemeData structure any more. I think this needs some
rethinking and maybe some expansion. As written, each of the first
two sentences needs a "the" at the beginning.

+ /*
+ * Create "append" paths for
partitioned joins. Do this before
+ * creating GatherPaths so that
partial "append" paths in
+ * partitioned joins will be considered.
+ */

I think you could shorten this to a single-line comment and just keep
the first sentence. Similarly in the other location where you have
the same sort of thing.

+ * child-joins. Otherwise, add_path might delete a path that some "append"
+ * path has reference to.

to which some path generated here has a reference.

Here and elsewhere, you use "append" rather than Append to refer to
the paths added. I suppose that's weasel-wording to work around the
fact that they might be either Append or MergeAppend paths, but I'm
not sure it's really going to convey that to anyone. I suggest
rephrasing those comments more generically, e.g.:

+ /* Add "append" paths containing paths from child-joins. */

You could say: Build additional paths for this rel from child-join paths.

Or something.

+ if (!REL_HAS_ALL_PART_PROPS(rel))
+ return;

Isn't this an unnecessarily expensive test? I mean, it shouldn't be
possible for it to have some arbitrary subset.

+ /*
+ * Every pair of joining relations we see here should have an equi-join
+ * between partition keys if this join has been deemed as a partitioned
+ * join. See build_joinrel_partition_info() for reasons.
+ */
+ Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype,
+
parent_restrictlist));

I suggest removing this assertion. Seems like overkill to me.

+ child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
+
child_rel1->relids,
+
child_rel2->relids);

It seems like we might end up doing this multiple times for the same
child join, if there are more than 2 tables involved. Not sure if
there's a good way to avoid that. Similarly for child_restrictlist.

+ pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks);

Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack
allocation is a lot faster, and then you don't need to pfree it.

+ /* Remove the relabel decoration. */

the -> any, decoration -> decorations

+ /*
+ * Replace the Var nodes of parent with those of children in
expressions.
+ * This function may be called within a temporary context, but the
+ * expressions will be shallow-copied into the plan. Hence copy those in
+ * the planner's context.
+ */

I can't make heads or tails of this comment.

--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -23,7 +23,9 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/planmain.h"
+#include "optimizer/prep.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/tlist.h"
#include "optimizer/var.h"
#include "parser/parsetree.h"
#include "utils/lsyscache.h"

Maybe not needed? This is the only hunk in this file? Or should this
be part of one of the later patches?

+ Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel));
+
+ /* Ensure child relation is really what it claims to be. */
+ Assert(IS_OTHER_REL(childrel));

I suggest tightening this up a bit by removing the comment and the
blank line that precedes it.

+ foreach(lc, parentrel->reltarget->exprs)
+ {
+ PlaceHolderVar *phv = lfirst(lc);
+
+ if (IsA(phv, PlaceHolderVar))
+ {
+ /*
+ * In case the placeholder Var refers to any
of the parent
+ * relations, translate it to refer to the
corresponding child.
+ */
+ if (bms_overlap(phv->phrels, parentrel->relids) &&
+ childrel->reloptkind == RELOPT_OTHER_JOINREL)
+ {
+ phv = (PlaceHolderVar *)
adjust_appendrel_attrs(root,
+
(Node *) phv,
+
nappinfos,
+
appinfos);
+ }
+
+ childrel->reltarget->exprs =
lappend(childrel->reltarget->exprs,
+
phv);
+ phv_added = true;
+ }
+ }

What if the PHV is buried down inside the expression someplace rather
than being at the top level? More generally, why are we not just
applying adjust_appendrel_attrs() to the whole expression?

+ /* Adjust the cost and width of child targetlist. */
+ if (phv_added)
+ {
+ childrel->reltarget->cost.startup =
parentrel->reltarget->cost.startup;
+ childrel->reltarget->cost.per_tuple =
parentrel->reltarget->cost.per_tuple;
+ childrel->reltarget->width = parentrel->reltarget->width;
+ }

Making this conditional on phv_added is probably not saving anything.
Branches are expensive.

/*
* Otherwise, anything in a baserel or joinrel
targetlist ought to be
- * a Var. (More general cases can only appear in
appendrel child
- * rels, which will never be seen here.)
+ * a Var or ConvertRowtypeExpr. For either of those,
find the original
+ * baserel where they originate.
*/

Hmm, but now we could potentially see an appendrel child rel here, so
don't we need to worry about more general cases? If not, let's
explain why not.

+ * if, it's a ConvertRowtypeExpr, it will be
computed only for the

American usage does not put a comma after if like this (unless you are
writing writing if, for example, blah blah blah -- but there the
commas are to surround for example, not due to the if itself).

+/*
+ * build_joinrel_partition_info
+ * If the join between given partitioned relations is
possibly partitioned
+ * set the partitioning scheme and partition keys
expressions for the
+ * join.
+ *
+ * If the two relations have same partitioning scheme, their join may be
+ * partitioned and will follow the same partitioning scheme as the joining
+ * relations.
+ */

I think you could drop the primary comment block and use the secondary
block as the primary one. That is, get rid of "If the join
between..." and promote "If the two relations...".

+ * The join is not partitioned, if any of the relations being joined are

Another comma that's not typical of American usage.

+ * For an N-way inner join, where every syntactic inner join
has equi-join

has -> has an

+ * For an N-way join with outer joins, where every syntactic join has an
+ * equi-join between partition keys and a matching partitioning scheme,
+ * outer join reordering identities in optimizer/README imply that only
+ * those pairs of join are legal which have an equi-join
between partition
+ * keys. Thus every pair of joining relations we see for this
join should
+ * have an equi-join between partition keys if this join has been deemed
+ * as a partitioned join.

In line 2, partition keys -> the partition keys
In line 3, outer join -> the outer join

"pairs of join" sounds wrong too, although I'm not sure how to reword it.

More broadly: I don't think I understand this comment. The statement
about "those pairs of join are legal which have an equi-join between
partition keys" doesn't match my understanding e.g. A IJ B ON A.x =
B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such
clause syntatically.

Maybe you could replace this whole comment block with something like
this: We can only consider this join as an input to further
partition-wise joins if (a) the input relations are partitioned, (b)
the partition schemes match, and (c) we can identify an equi-join
between the partition keys. Note that if it were possible for
have_partkey_equi_join to return different answers for the same
joinrel depending on which join ordering we try first, this logic
would break. That shouldn't happen, though, because of the way the
query planner deduces implied equalities.

+ * Join relation is partitioned using same partitioning scheme as the
+ * joining relations and has same bounds.

the same partitioning scheme

+ * An INNER join between two partitioned relations is partitioned by key
+ * expressions from both the relations. For tables A and B
partitioned by
+ * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by
+ * both A.a and B.b.
+ *
+ * A SEMI/ANTI join only retains data from the outer side and is
+ * partitioned by the partition keys of the outer side.

I would write: An INNER join between two partitioned relations can be
regarded as partitioned by either key expression. For example, A
INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on
B.b; they are equivalent. For a SEMI or ANTI join, the result can
only be regarded as being partitioned in the same manner as the outer
side, since the inner columns are not retained.

+ * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+ * B.b NULL. These rows may not fit the partitioning
conditions imposed on
+ * B.b. Hence, strictly speaking, the join is not partitioned by B.b.

Good.

+ * Strictly speaking, partition keys of an OUTER join should include
+ * partition key expressions from the OUTER side only. Consider a join

I would join this with the previous sentence instead of repeating
strictly speaking: ...and thus the partition keys should include
partition key expressions from the OUTER side only. After that
sentence, I'd skip a lot of the intermediate words here and continue
this way: However, because all commonly-used comparison operators are
strict, the presence of nulls on the outer side doesn't cause any
problem; they can't match anything at future join levels anyway.
Therefore, we track two sets of expressions: those that authentically
partition the relation (partexprs) and those that partition the
relation with the exception that extra nulls may be present
(nullable_partexprs). When the comparison operator is strict, the
latter is just as good as the former.

Then, I think you can omit the rest of what you have; it should be
clear enough what's going on for the full and right cases given that
explanation.

+ * being joined. partexprs and nullable_partexprs are arrays
containing part_scheme->partnatts

Long line, needs reflowing.

I don't think this is too far from being committable. You've done
some nice work here!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-04 16:00:18
Message-ID:	CA+TgmoafrQewWR1isO6AMTXxbGn8z5OrpAM9+FnMRnEv6S-GQw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 4, 2017 at 11:34 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> + Enables or disables the query planner's use of partition-wise join
> + plans. When enabled, it spends time in creating paths for joins between
> + partitions and consumes memory to construct expression nodes to be used
> + for those joins, even if partition-wise join does not result in the
> + cheapest path. The time and memory increase exponentially with the
> + number of partitioned tables being joined and they increase linearly
> + with the number of partitions. The default is <literal>off</>.
>
> I think this is too scary and too much technical detail. I think you
> could just say something like: Enables or disables use of
> partition-wise join, which allows a join between partitioned tables to
> be performed by joining the matching partitions. Partition-wise join
> currently applies only when the join conditions include all the
> columns of the partition keys, which must be of the same data type and
> have exactly matching sets of child partitions. Because
> partition-wise join planning can use significantly increase CPU time
> and memory usage during planning, the default is <literal>off</>.

Not enough caffeine, obviously: should have been something like --
Because partition-wise join can significantly increase the CPU and
memory costs of planning...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-04 18:54:30
Message-ID:	CA+TgmoYd0LZE+GcS1R+zTZowUG7cr7KWZutjfcPA2iHxEROkPw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 4, 2017 at 8:23 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I am not sure whether your assumption that expression with no Vars
> would have em_relids empty is correct. I wonder whether we will add
> any em_is_child members with empty em_relids; looking at
> process_equivalence() those come from RestrictInfo::left/right_relids
> which just indicates the relids at which that particular expression
> can be evaluated. Place holder vars is an example when that can
> happen, but there may be others. To verify this, I tried attached
> patch on master and ran make check. The assertion didn't trip. If
> em_relids is not NULL, bms_is_subset() is fine.

I spent some more time experimenting with this. I found that cases
where an em_is_child equivalence class contains multiple relids are
quite easy to generate, e.g. select * from foo, bar where foo.a +
bar.a = 0, where foo and bar are partitioned. However, I wasn't able
to generate a case where an em_is_child equivalence class has no
relids at all, and I'm out of ideas about how such a thing could
occur. I suspect it can't. I wondered whether there was some problem
with the multiple-relids case, but I can't find an example where that
misbehaves either. So maybe it's fine (or maybe I'm just not smart
enough to find the case where it breaks).

>> I don't think I believe that comment, either. In the case from which
>> that comment was copied (mark_dummy_rel), it was talking about a
>> RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
>> to joinrels creating during a GEQO cycle. But there's no similar
>> logic for ppilist, so I think what will happen here is that you'll end
>> up with a freed node in the middle of the list.
>
> In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist
> with dummy path being created in the same context as the RelOptInfo.
> Same applies here.

Oops. I was thinking that the ppilist was attached to some
planner-global structure, but it's not; it's hanging off the
RelOptInfo. So you're entirely right, and I'm just being dumb.

> We need to reparameterize any path which contains further paths and/or
> contains expressions that point to the parent relation. For a given
> path we need to reparameterize any paths that it contains and
> translate any expressions that are specific to that path. Expressions
> common across the paths are translated after the switch case. I have
> added this rule to the comment just above the switch case
> /*
> * Copy of the given path. Reparameterize any paths referenced by the given
> * path. Replace parent Vars in path specific expressions by corresponding
> * child Vars.
> */
> Does that look fine or we want to add explanation for every node handled here.

No, I don't think we want something for every node, just a general
explanation at the top of the function. Maybe something like this:

Most fields from the original path can simply be flat-copied, but any
expressions must be adjusted to refer to the correct varnos, and any
paths must be recursively reparameterized. Other fields that refer to
specific relids also need adjustment.

>> I don't see much point in the T_SubqueryScanPath and T_ResultPath
>> cases in reparameterize_path_by_child(). It's just falling through to
>> the default case.
>
> I added those cases separately to explain why we should not see those
> cases in that switch case. I think that explanation is important
> (esp. considering your comment above) and associating those comment
> with "case" statement looks better. Are you suggesting that we should
> add that explanation in default case?

Or leave the explanation out altogether.

>> I wonder if reparameterize_path_by_child() ought to default to
>> returning NULL rather than throwing an error; the caller would then
>> have to be prepared for that and skip building the path. But that
>> would be more like what reparameterize_path() does, and it would make
>> failure to include some relevant path type here a corner-case
>> performance bug rather than a correctness issue. It seems like
>> someone adding a new path type could quite easily fail to realize that
>> it might need to be added here, or might be unsure whether it's
>> necessary to add it here.
>
> I am OK with that. However reparameterize_path_by_child() and
> reparameterize_paths_by_child() are callers of
> reparameterize_path_by_child() so they will need to deal with NULL
> return. I am fine with that too, but making sure that we are on the
> same page. If we do that, we could simply assert that the switch case
> doesn't see T_SubqueryScanPath and T_ResultPath.

Or do nothing at all about those cases.

I noticed today that the version of the patchset I have here says in
the header comments for reparameterize_path_by_child() that it returns
NULL if it can't reparameterize, but that's not what it actually does.
If you make this change, the existing comment will become correct.

The problem with the NULL return convention is that it's not very
convenient when this function is recursing. Maybe we should change
this function's signature to be bool
reparameterize_path_by_child(PlannerInfo *root, RelOptInfo *child_rel,
Path **path); then you could do, e.g. if
(!reparameterize_path_by_child(root, child_rel, &bhpath->bitmapqual))
return;

But I don't really like that approach; it's still quite long-winded.
Instead, I suggest Stupid Macro Tricks:

#define ADJUST_CHILD_ATTRS(val) \
val = (List *) adjust_appendrel_attrs_multilevel((Node *) val,
child_rel->relids, child_rel->top_parent_relids);

#define REPARAMETERIZE_CHILD_PATH(val) \
val = reparameterize_path_by_child(root, val, child_rel); \
if (val == NULL) \
return NULL;

#define REPARAMETERIZE_CHILD_PATH_LIST(val) \
if (val != NIL) \
{ \
val = reparameterize_pathlist_by_child(root, val, child_rel); \
if (val == NIL) \
return NULL; \
}

With that, a complicated case like T_NestPath becomes just:

JoinPath *jpath;

FLAT_COPY_PATH(jpath, path, NestPath);
REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
new_path = (Path *) jpath;

Now, I admit that hiding stuff inside the macro definitions like that
is ugly. But I think it's still better than repeating boilerplate
code with finnicky internal bits lots of times.

> Yes, I too am thinking about the same. The only reason I have EXPLAIN
> output there is to check whether partition-wise join is being used or
> not. The testcase is not interested in the actual shape. It doesn't
> make sense to just test the output if partition-wise join is not used.
> May be a function examining the plan tree would help. The function
> will have to handle Result/Sort nodes on top and make sure that Append
> has join children. Do you have any other idea to check the shape of
> the plan tree without the details? Any EXPLAIN switch, existing
> functions etc.?

No, not really. We may just need to be prepared to fix whatever breaks.

>> I think it would be good to have a test case that shows multi-level
>> partition-wise join working across multiple levels. I wrote the
>> attached test, which you're welcome to use if you like it, adapt if
>> you sorta like it, or replace if you dislike it. The table names at
>> least should be changed to something less likely to duplicate other
>> tests.
>>
>
> There are tests for multi-level partitioned table in the file. They
> test whole partition hierarchy join, part of it being joined based on
> the quals. Search for
> --
> -- multi-leveled partitions
> --
>
> Have you looked at those? They test two-level partitioned tables and
> your test tests three-level partitioned table. I can modify the tests
> to have three levels of partitions and different partition schemes on
> different levels. Is that what you expect?

Oops, no, I just missed the test case. I saw the one that said "inner
join, qual covering only top-level partitions" and missed that there
were others later where the quals covered lower levels also.

Instead of "multi-leveled partitions" it might read better to say
"multiple levels of partitioning".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-05 09:38:08
Message-ID:	CAFjFpRdtZ1f6eY080LvgXL98cXOw2CFTvrAanA3+y8HRPAN=nQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 4, 2017 at 9:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> About your earlier comment of making build_joinrel_partition_info()
>>> simpler. Right now, the code assumes that partexprs or
>>> nullable_partexpr can be NULL when either of them is not populated.
>>> That may be saves a sizeof(pointer) * (number of keys) byes of memory.
>>> Saving that much memory may not be worth the complexity of code. So,
>>> we may always allocate memory for those arrays and fill it with NIL
>>> values when there are no key expressions to populate those. That will
>>> simplify the code. I haven't done that change in this patchset. I was
>>> busy debugging the Q7 regression. Let me know your comments about
>>> that.
>>
>> Hmm, I'm not sure that's the best approach, but let me look at it more
>> carefully before I express a firm opinion.
>
> Having studied this a bit more, I now think your proposed approach is
> a good idea.

Thanks. Done.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-05 13:48:47
Message-ID:	20171005134847.shzldz2ublrb3ny2@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:

> Regarding nomenclature and my previous griping about wisdom, I was
> wondering about just calling this a "partition join" like you have in
> the regression test. So the GUC would be enable_partition_join, you'd
> have generate_partition_join_paths(), etc. Basically just delete
> "wise" throughout.

If I understand correctly, what's being used here is the "-wise" suffix,
unrelated to wisdom, which Merriam Webster lists as "adverb combining
form" here https://www.merriam-webster.com/dictionary/wise (though you
have to scroll down a lot), which is defined as

1 a :in the manner of * crabwise * fanwise
b :in the position or direction of * slantwise * clockwise
2 :with regard to :in respect of * dollarwise

According to that, the right way to write this is "partitionwise join"
(no dash), which means "join in respect of partitions", "join with
regard to partitions".

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-05 13:49:46
Message-ID:	CA+Tgmob7cQrKxTcaUckmj6YSbbqsgOt+5pu5JjTPaKgHyOFFyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 5, 2017 at 9:48 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Robert Haas wrote:
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test. So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc. Basically just delete
>> "wise" throughout.
>
> If I understand correctly, what's being used here is the "-wise" suffix,
> unrelated to wisdom, which Merriam Webster lists as "adverb combining
> form" here https://www.merriam-webster.com/dictionary/wise (though you
> have to scroll down a lot), which is defined as
>
> 1 a :in the manner of * crabwise * fanwise
> b :in the position or direction of * slantwise * clockwise
> 2 :with regard to :in respect of * dollarwise
>
> According to that, the right way to write this is "partitionwise join"
> (no dash), which means "join in respect of partitions", "join with
> regard to partitions".

I'm fine with that, if others like it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-05 17:03:30
Message-ID:	CAFjFpRcsZnxCen88a-16R5EYqPCwFYnFThM+mjagu=B1QvxPVA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 5, 2017 at 7:18 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Robert Haas wrote:
>
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test. So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc. Basically just delete
>> "wise" throughout.
>
> If I understand correctly, what's being used here is the "-wise" suffix,
> unrelated to wisdom, which Merriam Webster lists as "adverb combining
> form" here https://www.merriam-webster.com/dictionary/wise (though you
> have to scroll down a lot), which is defined as
>
> 1 a :in the manner of * crabwise * fanwise
> b :in the position or direction of * slantwise * clockwise
> 2 :with regard to :in respect of * dollarwise
>

That's right.

> According to that, the right way to write this is "partitionwise join"
> (no dash), which means "join in respect of partitions", "join with
> regard to partitions".

Google lists mostly "partition wise" or "partition-wise" and very
rarely "partitionwise". The first being used in other DBMS literature.
The paper (there aren't many on this subject) I referred [1] uses
"partition-wise". It made more sense to replace " " or "-" with "_"
when syntax doesn't allow the first two. I am not against
"partitionwise" but I don't see any real reason why we should move
away from popular usage of this term.

[1] https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 06:48:47
Message-ID:	CAFjFpReEb3MQ3nobZW49vzZz_sYjCeiw+pFDdyNnpXG+StNWCw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 4, 2017 at 9:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I decided to skip over 0001 for today and spend some time looking at
>> 0002-0006.
>
> Back to 0001.
>
> + Enables or disables the query planner's use of partition-wise join
> + plans. When enabled, it spends time in creating paths for joins between
> + partitions and consumes memory to construct expression nodes to be used
> + for those joins, even if partition-wise join does not result in the
> + cheapest path. The time and memory increase exponentially with the
> + number of partitioned tables being joined and they increase linearly
> + with the number of partitions. The default is <literal>off</>.
>
> I think this is too scary and too much technical detail. I think you
> could just say something like: Enables or disables use of
> partition-wise join, which allows a join between partitioned tables to
> be performed by joining the matching partitions. Partition-wise join
> currently applies only when the join conditions include all the
> columns of the partition keys, which must be of the same data type and
> have exactly matching sets of child partitions. Because
> partition-wise join planning can use significantly increase CPU time
> and memory usage during planning, the default is <literal>off</>.

Done. With slight change. "include all the columns of the partition
keys" has a different meaning when partition key is an expression, so
I have used "include all the partition keys". Also changed the last
sentence as "... can use significantly more CPU time and memory during
planning ...". Please feel free to revert those changes, if you don't
like them.

>
> +partitioned table. The join partners can not be found in other partitions. This
> +condition allows the join between partitioned tables to be broken into joins
> +between the matching partitions. The resultant join is partitioned in the same
>
> "The join partners can not be found in other partitions." is redundant
> with the previous sentence. I suggest deleting it. I also suggest
> "This condition allows the join between partitioned tables to be
> broken" -> "Because of this, the join between partitioned tables can
> be broken".

Done.

>
> +relation" for both partitioned table as well as join between partitioned tables
> +which can use partition-wise join technique.
>
> for either a partitioned table or a join between compatibly partitioned tables

Done.

>
> +Partitioning properties of a partitioned relation are stored in
> +PartitionSchemeData structure. Planner maintains a list of canonical partition
> +schemes (distinct PartitionSchemeData objects) so that any two partitioned
> +relations with same partitioning scheme share the same PartitionSchemeData
> +object. This reduces memory consumed by PartitionSchemeData objects and makes
> +it easy to compare the partition schemes of joining relations.
>
> Not all of the partitioning properties are stored in the
> PartitionSchemeData structure any more. I think this needs some
> rethinking and maybe some expansion. As written, each of the first
> two sentences needs a "the" at the beginning.

Changed to

The partitioning properties of a partitioned relation are stored in its
RelOptInfo. The information about data types of partition keys are stored in
PartitionSchemeData structure. The planner maintains a list of canonical
partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of
any two partitioned relations with same partitioning scheme point to the same
PartitionSchemeData object. This reduces memory consumed by
PartitionSchemeData objects and makes it easy to compare the partition schemes
of joining relations.

Let me know if this looks good.

>
> + /*
> + * Create "append" paths for
> partitioned joins. Do this before
> + * creating GatherPaths so that
> partial "append" paths in
> + * partitioned joins will be considered.
> + */
>
> I think you could shorten this to a single-line comment and just keep
> the first sentence. Similarly in the other location where you have
> the same sort of thing.

Done.

>
> + * child-joins. Otherwise, add_path might delete a path that some "append"
> + * path has reference to.
>
> to which some path generated here has a reference.

Done.

>
> Here and elsewhere, you use "append" rather than Append to refer to
> the paths added. I suppose that's weasel-wording to work around the
> fact that they might be either Append or MergeAppend paths, but I'm
> not sure it's really going to convey that to anyone. I suggest
> rephrasing those comments more generically, e.g.:
>
> + /* Add "append" paths containing paths from child-joins. */
>
> You could say: Build additional paths for this rel from child-join paths.
>
> Or something.

Done. Removed word "append" from the comments in merge_clump(),
standard_join_search() and prologue of
generate_partition_wise_join_paths(). Changed the last comment as per
your suggestion.

>
> + if (!REL_HAS_ALL_PART_PROPS(rel))
> + return;
> Isn't this an unnecessarily expensive test? I mean, it shouldn't be
> possible for it to have some arbitrary subset.

All this function cares about is whether the given relation has any
partitions which can be simply checked by rel->nparts > 0 and
rel->part_rels != NULL. We need to explicitly check part_rels because
an outer join which has empty inner side in every pair will have
part_scheme, partbounds, nparts all set, but not part_rels. See
relevant comments in try_partition_wise_join() for more details. I
have now replaced macro with checks on rel->nparts and rel->part_rels.
This would change with the last patch dealing with partition-wise join
involving dummy relations. Once we have that an outer join like above
will also have part_rels set. But even then I think checking for
part_rels and nparts makes more sense than part_scheme and partbounds.

>
> + /*
> + * Every pair of joining relations we see here should have an equi-join
> + * between partition keys if this join has been deemed as a partitioned
> + * join. See build_joinrel_partition_info() for reasons.
> + */
> + Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype,
> +
> parent_restrictlist));
>
> I suggest removing this assertion. Seems like overkill to me.

I thought it was good to have there to catch any bug breaking that
rule. But I have removed it as per your suggestion.
Do you think we should remove following assertions as well?
/*
* Since we allow partition-wise join only when the partition bounds of
* the joining relations exactly match, the partition bounds of the join
* should match those of the joining relations.
*/
Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
joinrel->part_scheme->parttyplen,
joinrel->part_scheme->parttypbyval,
joinrel->boundinfo, rel1->boundinfo));
Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
joinrel->part_scheme->parttyplen,
joinrel->part_scheme->parttypbyval,
joinrel->boundinfo, rel2->boundinfo));

>
> + child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
> +
> child_rel1->relids,
> +
> child_rel2->relids);
>
> It seems like we might end up doing this multiple times for the same
> child join, if there are more than 2 tables involved. Not sure if
> there's a good way to avoid that.

IIUC every pair of joining relations will use a different sjinfo, A
LEFT JOIN B LEFT JOIN C will have two sjinfos one for AB and other for
BC. For ABC we will use the one for AB to join A with BC and we will
use one for BC to join AB with C. I agree that we are building sjinfo
for AB twice once for joining AB and then for A(BC). In order to avoid
that we will have to somehow link the parent sjinfo with child sjinfo
and avoid translating parent sjinfo again and again. May be the parent
sjinfo can contain a cache of child sjinfos.Do we want to do that in
this patch set? We could avoid translation entirely, if we could use
parent sjinfo for joining children. But that's a pretty deep surgery.

> Similarly for child_restrictlist.

Similary for restrictlist. Every joining pair has a different
restrictlist. Otherwise, we would have saved restrictlist in the
joinrel itself.

>
> + pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks);
>
> Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack
> allocation is a lot faster, and then you don't need to pfree it.

That's a good idea. Done.

>
> + /* Remove the relabel decoration. */
>
> the -> any, decoration -> decorations

Done.

>
> + /*
> + * Replace the Var nodes of parent with those of children in
> expressions.
> + * This function may be called within a temporary context, but the
> + * expressions will be shallow-copied into the plan. Hence copy those in
> + * the planner's context.
> + */
>
> I can't make heads or tails of this comment.

haha! My bad. the second sentence is something left of the code where
the child-joins used to be planned in a temporary memory context.
That's not true any more. Removed the entire comment.

>
> --- a/src/backend/optimizer/util/pathnode.c
> +++ b/src/backend/optimizer/util/pathnode.c
> @@ -23,7 +23,9 @@
> #include "optimizer/pathnode.h"
> #include "optimizer/paths.h"
> #include "optimizer/planmain.h"
> +#include "optimizer/prep.h"
> #include "optimizer/restrictinfo.h"
> +#include "optimizer/tlist.h"
> #include "optimizer/var.h"
> #include "parser/parsetree.h"
> #include "utils/lsyscache.h"
>
> Maybe not needed? This is the only hunk in this file? Or should this
> be part of one of the later patches?

I think 0005. Sorry. I will move it there.

>
> + Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel));
> +
> + /* Ensure child relation is really what it claims to be. */
> + Assert(IS_OTHER_REL(childrel));
>
> I suggest tightening this up a bit by removing the comment and the
> blank line that precedes it.

Done.

>
> + foreach(lc, parentrel->reltarget->exprs)
> + {
> + PlaceHolderVar *phv = lfirst(lc);
> +
> + if (IsA(phv, PlaceHolderVar))
> + {
> + /*
> + * In case the placeholder Var refers to any
> of the parent
> + * relations, translate it to refer to the
> corresponding child.
> + */
> + if (bms_overlap(phv->phrels, parentrel->relids) &&
> + childrel->reloptkind == RELOPT_OTHER_JOINREL)
> + {
> + phv = (PlaceHolderVar *)
> adjust_appendrel_attrs(root,
> +
> (Node *) phv,
> +
> nappinfos,
> +
> appinfos);
> + }
> +
> + childrel->reltarget->exprs =
> lappend(childrel->reltarget->exprs,
> +
> phv);
> + phv_added = true;
> + }
> + }
>
> What if the PHV is buried down inside the expression someplace rather
> than being at the top level?

That can't happen. See add_placeholders_to_joinrel(), which adds these
placeholders to joinrel's target. That function adds PHVs as bare
nodes, not embedded into something else.

> More generally, why are we not just
> applying adjust_appendrel_attrs() to the whole expression?

Usually targetlists of join have Var nodes which bubble up from the
base relations. Even PHVs bubble up from the lowest join where they
can be evaluated. If we translate reltarget, we will allocate new Var
nodes for every join relation consuming more memory and then setrefs
will need to compare the contents of those nodes instead of just
pointer comparison. We use this code and attr_needed to avoid memory
consumption and setref's CPU consumption.

>
> + /* Adjust the cost and width of child targetlist. */
> + if (phv_added)
> + {
> + childrel->reltarget->cost.startup =
> parentrel->reltarget->cost.startup;
> + childrel->reltarget->cost.per_tuple =
> parentrel->reltarget->cost.per_tuple;
> + childrel->reltarget->width = parentrel->reltarget->width;
> + }
>
> Making this conditional on phv_added is probably not saving anything.
> Branches are expensive.

Ok.

If there are not PHVs in the query i.e. when root->placeholders_list
is NIL, we don't need to scan reltarget->exprs. I have added that
optimization.

>
> /*
> * Otherwise, anything in a baserel or joinrel
> targetlist ought to be
> - * a Var. (More general cases can only appear in
> appendrel child
> - * rels, which will never be seen here.)
> + * a Var or ConvertRowtypeExpr. For either of those,
> find the original
> + * baserel where they originate.
> */
>
> Hmm, but now we could potentially see an appendrel child rel here, so
> don't we need to worry about more general cases? If not, let's
> explain why not.

By more general cases, that comment means ConvertRowtypeExpr or
RowExpr, nothing else. A base relation's tlist can have only Var nodes
when it reaches this comment. When a parent Var node is subjected to
adjust_appendrel_attrs() it is translated to a Var node for all
varattnos except 0, which indicates a whole-row var. For a child
table, a whole-row var is always a named row type and hence gets
translated to a ConvertRowExpr. Other kinds of children (subqueries in
union etc.) can not appear here since they do not participate in a
join directly. So it's really a Var and ConvertRowtypeExpr. I have
modified the comment to explain this.

>
> + * if, it's a ConvertRowtypeExpr, it will be
> computed only for the
>
> American usage does not put a comma after if like this (unless you are
> writing writing if, for example, blah blah blah -- but there the
> commas are to surround for example, not due to the if itself).

That comma was unintentional. Removed.

>
> +/*
> + * build_joinrel_partition_info
> + * If the join between given partitioned relations is
> possibly partitioned
> + * set the partitioning scheme and partition keys
> expressions for the
> + * join.
> + *
> + * If the two relations have same partitioning scheme, their join may be
> + * partitioned and will follow the same partitioning scheme as the joining
> + * relations.
> + */
>
> I think you could drop the primary comment block and use the secondary
> block as the primary one. That is, get rid of "If the join
> between..." and promote "If the two relations...".

Done.

>
> + * The join is not partitioned, if any of the relations being joined are
>
> Another comma that's not typical of American usage.

Done.

>
> + * For an N-way inner join, where every syntactic inner join
> has equi-join
>
> has -> has an
>
> + * For an N-way join with outer joins, where every syntactic join has an
> + * equi-join between partition keys and a matching partitioning scheme,
> + * outer join reordering identities in optimizer/README imply that only
> + * those pairs of join are legal which have an equi-join
> between partition
> + * keys. Thus every pair of joining relations we see for this
> join should
> + * have an equi-join between partition keys if this join has been deemed
> + * as a partitioned join.
>
> In line 2, partition keys -> the partition keys
> In line 3, outer join -> the outer join
>
> "pairs of join" sounds wrong too, although I'm not sure how to reword it.
>
> More broadly: I don't think I understand this comment. The statement
> about "those pairs of join are legal which have an equi-join between
> partition keys" doesn't match my understanding e.g. A IJ B ON A.x =
> B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such
> clause syntatically.
>
> Maybe you could replace this whole comment block with something like
> this: We can only consider this join as an input to further
> partition-wise joins if (a) the input relations are partitioned, (b)
> the partition schemes match, and (c) we can identify an equi-join
> between the partition keys. Note that if it were possible for
> have_partkey_equi_join to return different answers for the same
> joinrel depending on which join ordering we try first, this logic
> would break. That shouldn't happen, though, because of the way the
> query planner deduces implied equalities.

Hmm. I meant the second para to be read in the context of the first.
Since AB is inner join A.x and B.x are replaceable (I forgot the
correct term, identity?) and thus A.x = C.x implies B.x = C.x thus
allowing join BC. But I think your version of the comment is easy to
understand. But I think it should also refer to the way planner
reorders joins; that's what causes us to worry about every join order
being partitioned. I think we should redirect a reader, who wants to
understand more about implied equalities and join orders, to
optimizer/README. So, I have changed the last sentence to read "That
shouldn't happen, though, because of the way the query planner deduces
implied equalities and reorders joins. See optimizer/README for
details." If you don't like my changes, please feel free to drop
those.

In the code block following this comment, I have used shorter variable
names instead of accurate but long ones. E.g. outer_expr should have
been outer_partexpr and outer_null_expr should have been
outer_nullable_partexpr. Please feel free to change those if you don't
like them or let me know if you have any better ideas and I will
update the patch with those ideas.

>
> + * Join relation is partitioned using same partitioning scheme as the
> + * joining relations and has same bounds.
>
> the same partitioning scheme

Done.

>
> + * An INNER join between two partitioned relations is partitioned by key
> + * expressions from both the relations. For tables A and B
> partitioned by
> + * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by
> + * both A.a and B.b.
> + *
> + * A SEMI/ANTI join only retains data from the outer side and is
> + * partitioned by the partition keys of the outer side.
>
> I would write: An INNER join between two partitioned relations can be
> regarded as partitioned by either key expression. For example, A
> INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on
> B.b; they are equivalent. For a SEMI or ANTI join, the result can
> only be regarded as being partitioned in the same manner as the outer
> side, since the inner columns are not retained.

Done.

>
> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
> + * B.b NULL. These rows may not fit the partitioning
> conditions imposed on
> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>
> Good.
>
> + * Strictly speaking, partition keys of an OUTER join should include
> + * partition key expressions from the OUTER side only. Consider a join
>
> I would join this with the previous sentence instead of repeating
> strictly speaking: ...and thus the partition keys should include
> partition key expressions from the OUTER side only. After that
> sentence, I'd skip a lot of the intermediate words here and continue
> this way: However, because all commonly-used comparison operators are
> strict, the presence of nulls on the outer side doesn't cause any
> problem; they can't match anything at future join levels anyway.
> Therefore, we track two sets of expressions: those that authentically
> partition the relation (partexprs) and those that partition the
> relation with the exception that extra nulls may be present
> (nullable_partexprs). When the comparison operator is strict, the
> latter is just as good as the former.
>
> Then, I think you can omit the rest of what you have; it should be
> clear enough what's going on for the full and right cases given that
> explanation.

I liked this version. Changed the comments as per your suggestions.

>
> + * being joined. partexprs and nullable_partexprs are arrays
> containing part_scheme->partnatts
>
> Long line, needs reflowing.

Done. Also fixed a grammatical mistake: contains -> contain in the
last line of that paragraph.

>
> I don't think this is too far from being committable. You've done
> some nice work here!
>

Thanks a lot for your detailed reviews and guidance. I will post the
updated patchset with my next reply.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 11:39:51
Message-ID:	CAFjFpReEiCDi46PaoLpX_Wf6=+4VGbC6B+Hu_r=hXLNKWOVszQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 5, 2017 at 12:24 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> We need to reparameterize any path which contains further paths and/or
>> contains expressions that point to the parent relation. For a given
>> path we need to reparameterize any paths that it contains and
>> translate any expressions that are specific to that path. Expressions
>> common across the paths are translated after the switch case. I have
>> added this rule to the comment just above the switch case
>> /*
>> * Copy of the given path. Reparameterize any paths referenced by the given
>> * path. Replace parent Vars in path specific expressions by corresponding
>> * child Vars.
>> */
>> Does that look fine or we want to add explanation for every node handled here.
>
> No, I don't think we want something for every node, just a general
> explanation at the top of the function. Maybe something like this:
>
> Most fields from the original path can simply be flat-copied, but any
> expressions must be adjusted to refer to the correct varnos, and any
> paths must be recursively reparameterized. Other fields that refer to
> specific relids also need adjustment.

Done.

>
>>> I don't see much point in the T_SubqueryScanPath and T_ResultPath
>>> cases in reparameterize_path_by_child(). It's just falling through to
>>> the default case.
>>
>> I added those cases separately to explain why we should not see those
>> cases in that switch case. I think that explanation is important
>> (esp. considering your comment above) and associating those comment
>> with "case" statement looks better. Are you suggesting that we should
>> add that explanation in default case?
>
> Or leave the explanation out altogether.

Ok. Removed the explanation and the cases.

>
>>> I wonder if reparameterize_path_by_child() ought to default to
>>> returning NULL rather than throwing an error; the caller would then
>>> have to be prepared for that and skip building the path. But that
>>> would be more like what reparameterize_path() does, and it would make
>>> failure to include some relevant path type here a corner-case
>>> performance bug rather than a correctness issue. It seems like
>>> someone adding a new path type could quite easily fail to realize that
>>> it might need to be added here, or might be unsure whether it's
>>> necessary to add it here.
>>
>> I am OK with that. However reparameterize_path_by_child() and
>> reparameterize_paths_by_child() are callers of
>> reparameterize_path_by_child() so they will need to deal with NULL
>> return. I am fine with that too, but making sure that we are on the
>> same page. If we do that, we could simply assert that the switch case
>> doesn't see T_SubqueryScanPath and T_ResultPath.
>
> Or do nothing at all about those cases.
>
> I noticed today that the version of the patchset I have here says in
> the header comments for reparameterize_path_by_child() that it returns
> NULL if it can't reparameterize, but that's not what it actually does.
> If you make this change, the existing comment will become correct.
>
> The problem with the NULL return convention is that it's not very
> convenient when this function is recursing. Maybe we should change
> this function's signature to be bool
> reparameterize_path_by_child(PlannerInfo *root, RelOptInfo *child_rel,
> Path **path); then you could do, e.g. if
> (!reparameterize_path_by_child(root, child_rel, &bhpath->bitmapqual))
> return;
>
> But I don't really like that approach; it's still quite long-winded.
> Instead, I suggest Stupid Macro Tricks:
>
> #define ADJUST_CHILD_ATTRS(val) \
> val = (List *) adjust_appendrel_attrs_multilevel((Node *) val,
> child_rel->relids, child_rel->top_parent_relids);

It so happens that every node we subject to
adjust_appendrel_attrs_multilevel is List, so this is ok. In case we
need to adjust some other type of node in future, we will pass node
type too. For now, I have used the macro with (List *) hardcoded
there. Do we write the whole macro on the same line even if it
overflows? I see that being done for CONSIDER_PATH_STARTUP_COST
defined in the same file and you also seem to suggest the same. But
macros at other places are indented. For now, I have indented the
macro on multiple lines.

>
> #define REPARAMETERIZE_CHILD_PATH(val) \
> val = reparameterize_path_by_child(root, val, child_rel); \
> if (val == NULL) \
> return NULL;
>
> #define REPARAMETERIZE_CHILD_PATH_LIST(val) \
> if (val != NIL) \
> { \
> val = reparameterize_pathlist_by_child(root, val, child_rel); \
> if (val == NIL) \
> return NULL; \
> }

I added do { } while (0) around these code blocks like other places.
Please feel free to remove it if you don't think that's not needed.

>
> With that, a complicated case like T_NestPath becomes just:
>
> JoinPath *jpath;
>
> FLAT_COPY_PATH(jpath, path, NestPath);
> REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
> REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
> ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
> new_path = (Path *) jpath;
>
> Now, I admit that hiding stuff inside the macro definitions like that
> is ugly. But I think it's still better than repeating boilerplate
> code with finnicky internal bits lots of times.

I too do not like hiding stuff under macros since that make debugging
hard, but with these macros code looks really elegant. Thanks for the
suggestion.

Also fixed some lines overflowing character limit.

>
>> Yes, I too am thinking about the same. The only reason I have EXPLAIN
>> output there is to check whether partition-wise join is being used or
>> not. The testcase is not interested in the actual shape. It doesn't
>> make sense to just test the output if partition-wise join is not used.
>> May be a function examining the plan tree would help. The function
>> will have to handle Result/Sort nodes on top and make sure that Append
>> has join children. Do you have any other idea to check the shape of
>> the plan tree without the details? Any EXPLAIN switch, existing
>> functions etc.?
>
> No, not really. We may just need to be prepared to fix whatever breaks.

Sure.

>
> Instead of "multi-leveled partitions" it might read better to say
> "multiple levels of partitioning".

Done.

Here's updated set of patches, rebased on top of the latest head.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v36.tar.gz	application/x-gzip	122.7 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 11:47:41
Message-ID:	CAFjFpRfGG9K3071fHSmv0bHBRbd_RdOWmW9UDjXeRM1vnyiyeg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 6, 2017 at 5:09 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>
> Here's updated set of patches, rebased on top of the latest head.

In this patchset reparameterize_pathlist_by_child() ignores NULL
return from reparameterize_path_by_child(). Fixed that in the attached
patchset.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v37.tar.gz	application/x-gzip	122.7 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 12:40:59
Message-ID:	CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 6, 2017 at 5:17 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Fri, Oct 6, 2017 at 5:09 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>
>> Here's updated set of patches, rebased on top of the latest head.
>
> In this patchset reparameterize_pathlist_by_child() ignores NULL
> return from reparameterize_path_by_child(). Fixed that in the attached
> patchset.
>

Sorry. I sent a wrong file. Here's the real v37.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pg_dp_join_patches_v37.tar.gz	application/x-gzip	122.8 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 15:15:04
Message-ID:	CA+TgmoZJjFgmoWgO-T1=qcVQJ-_wyet+aUSkxr=04VossEX6Eg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Sorry. I sent a wrong file. Here's the real v37.

Committed 0001-0006. I made some assorted comment and formatting
changes and two small substantive changes:

- In try_nestloop_path, bms_free(outerrelids) before returning if we
can't reparameterize.

- Moved the call to try_partition_wise_join inside
populate_joinrel_with_paths, instead of always calling it just after
that function is called.

I think this is very good work and I'm excited about the feature. Now
I'll wait to see whether the buildfarm, or Tom, yell at me for
whatever problems this may still have...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 15:48:02
Message-ID:	CAFjFpRcrF3RqKO=LnS0u9xqD0Zu3O0OGCUfvvfO8s0m5tuk8fQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Sorry. I sent a wrong file. Here's the real v37.
>
> Committed 0001-0006. I made some assorted comment and formatting
> changes and two small substantive changes:
>
> - In try_nestloop_path, bms_free(outerrelids) before returning if we
> can't reparameterize.

Hmm. I missed that.

>
> - Moved the call to try_partition_wise_join inside
> populate_joinrel_with_paths, instead of always calling it just after
> that function is called.

This looks good too.

>
> I think this is very good work and I'm excited about the feature.

Thanks a lot Robert for detailed review and guidance. Thanks a lot
Rafia for benchmarking the feature with TPCH and esp. very large scale
database and also for testing and reported some real issues. Thanks
Rajkumar for testing it with an exhaustive testset. Thanks Amit
Langote, Thomas Munro, Dilip Kumar, Antonin Houska, Alvaro Herrera and
Amit Khandekar for their review comments and suggestions. Thanks
Jeevan Chalke, who used the patchset to implement partition-wise
aggregates and provided some insights offlist. Sorry if I have missed
anybody.

As Robert says in the commit message, there's more to do but now that
we have basic feature, improving it incrementally becomes a lot
easier.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 19:07:51
Message-ID:	CAFjFpRcRBqoKLZSNmRsjKr81uEP=ennvqSQaXVCCBTXvJ2rW+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I think this is very good work and I'm excited about the feature. Now
> I'll wait to see whether the buildfarm, or Tom, yell at me for
> whatever problems this may still have...
>

Buildfarm animal prion turned red. Before going into that failure,
good news is that the other animals are green. So the plans are
stable.

prion runs the regression with -DRELCACHE_FORCE_RELEASE, which
destroys a relcache entry as soon as its reference count drops down to
0. This destroys everything that's there in corresponding relcache
entry including partition key information and partition descriptor
information. find_partition_scheme() and set_relation_partition_info()
both assume that the relcache information will survive as long as the
relation lock is held. They do not copy the relevant partitioning
information but just copy the pointers. That assumption is wrong.
Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to
zero, the data in partition scheme and partition bounds goes invalid
and various checks to see if partition wise join is possible fail.
That causes partition_join test to fail on prion. But I think, the bug
could cause crash as well.

The fix is to copy the relevant partitioning information from relcache
into PartitionSchemeData and RelOptInfo. Here's a quick patch with
that fix.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
pwj_copy_partinfo.patch	text/x-patch	5.6 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-06 19:34:05
Message-ID:	CA+TgmoY2g8s=GboXxS3+31_LF+D68FrPm_tF6VVJkfsznbx3EA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 6, 2017 at 3:07 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I think this is very good work and I'm excited about the feature. Now
>> I'll wait to see whether the buildfarm, or Tom, yell at me for
>> whatever problems this may still have...
>
> Buildfarm animal prion turned red. Before going into that failure,
> good news is that the other animals are green. So the plans are
> stable.
>
> prion runs the regression with -DRELCACHE_FORCE_RELEASE, which
> destroys a relcache entry as soon as its reference count drops down to
> 0. This destroys everything that's there in corresponding relcache
> entry including partition key information and partition descriptor
> information. find_partition_scheme() and set_relation_partition_info()
> both assume that the relcache information will survive as long as the
> relation lock is held. They do not copy the relevant partitioning
> information but just copy the pointers. That assumption is wrong.
> Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to
> zero, the data in partition scheme and partition bounds goes invalid
> and various checks to see if partition wise join is possible fail.
> That causes partition_join test to fail on prion. But I think, the bug
> could cause crash as well.
>
> The fix is to copy the relevant partitioning information from relcache
> into PartitionSchemeData and RelOptInfo. Here's a quick patch with
> that fix.

Committed. I hope that makes things less red rather than more,
because I'm going to be AFK for a few hours anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-09 06:05:24
Message-ID:	CAFjFpRcPvT5ay9_p3e-k2Cwu4bW_rypON7ceJVWhsU3Uk4Nmmg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Committed. I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

Here's the last patch, dealing with the dummy relations, rebased. With
this fix every join order of a partitioned join can be considered
partitioned. (This wasn't the case earlier when dummy relation was
involved.). So, we can allocate the child-join RelOptInfo array in
build_joinrel_partition_info(), instead of waiting for an appropriate
pair to arrive in try_partition_wise_join().
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
0001-Support-partition-wise-join-for-dummy-partitioned-re.patch	text/x-patch	4.2 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-11 14:17:57
Message-ID:	CA+TgmoYzat-HQ-Qno+n_QgkBwq2GBL=9u5v0uEe1MmJyFPmG+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Committed. I hope that makes things less red rather than more,
>> because I'm going to be AFK for a few hours anyway.
>
> Here's the last patch, dealing with the dummy relations, rebased. With
> this fix every join order of a partitioned join can be considered
> partitioned. (This wasn't the case earlier when dummy relation was
> involved.). So, we can allocate the child-join RelOptInfo array in
> build_joinrel_partition_info(), instead of waiting for an appropriate
> pair to arrive in try_partition_wise_join().

Wouldn't a far more general approach be to allow a partition-wise join
between a partitioned table and an unpartitioned table, considering
the result as partitioned? That seems like it would very often yield
much better query plans than what we have right now, and also make the
need for this particular thing go away.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-12 02:43:08
Message-ID:	CAFjFpRc+=tk-2Bihjvig8S-=4G8y7eerH6swQz6-NuXmJN0U4Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 11, 2017 at 7:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Committed. I hope that makes things less red rather than more,
>>> because I'm going to be AFK for a few hours anyway.
>>
>> Here's the last patch, dealing with the dummy relations, rebased. With
>> this fix every join order of a partitioned join can be considered
>> partitioned. (This wasn't the case earlier when dummy relation was
>> involved.). So, we can allocate the child-join RelOptInfo array in
>> build_joinrel_partition_info(), instead of waiting for an appropriate
>> pair to arrive in try_partition_wise_join().
>
> Wouldn't a far more general approach be to allow a partition-wise join
> between a partitioned table and an unpartitioned table, considering
> the result as partitioned? That seems like it would very often yield
> much better query plans than what we have right now, and also make the
> need for this particular thing go away.
>

You are suggesting that a dummy partitioned table be treated as an
un-partitioned table and apply above suggested optimization. A join
between a partitioned and unpartitioned table is partitioned by the
keys of only partitioned table. An unpartitioned table doesn't have
any keys, so this is fine. But a dummy partitioned table does have
keys. Recording them as keys of the join relation helps when it joins
to other relations. Furthermore a join between partitioned and
unpartitioned table doesn't require any equi-join condition on
partition keys of partitioned table but a join between partitioned
tables is considered to be partitioned by keys on both sides only when
there is an equi-join. So, when implementing a partitioned join
between a partitioned and an unpartitioned table, we will have to make
a special case to record partition keys when the unpartitioned side is
actually a dummy partitioned table. That might be awkward.

Because we don't have dummy children relation in all cases, we already
have some awkwardness like allocating part_rels array only when we
encounter a join order which has all the children. This patch removes
that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-12 17:19:35
Message-ID:	CA+TgmoaKHn+X4ui5Q9g+vyz0JR3tX-uJVA6hqPv2KAa84hroZw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> You are suggesting that a dummy partitioned table be treated as an
> un-partitioned table and apply above suggested optimization. A join
> between a partitioned and unpartitioned table is partitioned by the
> keys of only partitioned table. An unpartitioned table doesn't have
> any keys, so this is fine. But a dummy partitioned table does have
> keys. Recording them as keys of the join relation helps when it joins
> to other relations. Furthermore a join between partitioned and
> unpartitioned table doesn't require any equi-join condition on
> partition keys of partitioned table but a join between partitioned
> tables is considered to be partitioned by keys on both sides only when
> there is an equi-join. So, when implementing a partitioned join
> between a partitioned and an unpartitioned table, we will have to make
> a special case to record partition keys when the unpartitioned side is
> actually a dummy partitioned table. That might be awkward.

It seems to me that what we really need here is to move all of this
stuff into a separate struct:

/* used for partitioned relations */
PartitionScheme part_scheme; /* Partitioning scheme. */
int nparts; /* number of
partitions */
struct PartitionBoundInfoData *boundinfo; /* Partition bounds */
struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions,

* stored in the same order of bounds */
List **partexprs; /* Non-nullable partition key
expressions. */
List **nullable_partexprs; /* Nullable partition key
expressions. */

...and then have a RelOptInfo carry a pointer to a list of those
structures. That lets us consider multiple possible partition schemes
for the same relation. For instance, suppose that a user joins four
relations, P1, P2, Q1, and Q2. P1 and P2 are compatibly partitioned.
Q1 and Q2 are compatibly partitioned (but not compatible with P1 and
P2).

Furthermore, let's suppose that the optimal join order begins with a
join between P1 and Q1. When we construct the paths for that joinrel,
we can either join all of P1 to all of Q1 (giving up on partition-wise
join), or we can join each partition of P1 to all of Q1 (producing a
result partitioned compatibly with P1 and allowing for a future
partition-wise join to P2), or we can join each partition of Q1 to all
of P1 (producing a result partitioned compatibly with Q1 and allowing
for a future partition-wise join to Q2). Any of those could win
depending on the details. With the data structure as it is today,
we'd have to choose whether to mark the joinrel as partitioned like P1
or like Q1, but that's not really what we need here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-13 02:22:00
Message-ID:	CAFjFpRfz=24xW_p2S3GyECzf0zOLWh5skTAZQN215A8qHUfheQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 12, 2017 at 10:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> You are suggesting that a dummy partitioned table be treated as an
>> un-partitioned table and apply above suggested optimization. A join
>> between a partitioned and unpartitioned table is partitioned by the
>> keys of only partitioned table. An unpartitioned table doesn't have
>> any keys, so this is fine. But a dummy partitioned table does have
>> keys. Recording them as keys of the join relation helps when it joins
>> to other relations. Furthermore a join between partitioned and
>> unpartitioned table doesn't require any equi-join condition on
>> partition keys of partitioned table but a join between partitioned
>> tables is considered to be partitioned by keys on both sides only when
>> there is an equi-join. So, when implementing a partitioned join
>> between a partitioned and an unpartitioned table, we will have to make
>> a special case to record partition keys when the unpartitioned side is
>> actually a dummy partitioned table. That might be awkward.
>
> It seems to me that what we really need here is to move all of this
> stuff into a separate struct:
>
> /* used for partitioned relations */
> PartitionScheme part_scheme; /* Partitioning scheme. */
> int nparts; /* number of
> partitions */
> struct PartitionBoundInfoData *boundinfo; /* Partition bounds */
> struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions,
>
> * stored in the same order of bounds */
> List **partexprs; /* Non-nullable partition key
> expressions. */
> List **nullable_partexprs; /* Nullable partition key
> expressions. */
>

In a very early patch I had PartitionOptInfo to hold all of this.
RelOptInfo then had a pointer of PartitionOptInfo, if it was
partitioned. When a relation can be partitioned in multiple ways like
what you describe or because join by re-partitioning is efficient,
RelOptInfo would have a list of those. But the representation needs to
be thought through. I am wondering whether this should be modelled
like IndexOptInfo. I am not sure. This is a topic of much larger
discussion.

I think we are digressing. We were discussing my patch to handle dummy
partitioned relation, whose children are not marked dummy and do not
have pathlists set. Do you still think that we should leave that
aside?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-16 11:33:07
Message-ID:	CAFjFpReL7+1ien=-21rhjpO3bV7aAm1rQ8XgLVk2csFagSzpZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>>
>> The fix is to copy the relevant partitioning information from relcache
>> into PartitionSchemeData and RelOptInfo. Here's a quick patch with
>> that fix.
>
> Committed. I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

set_append_rel_size() crashes when it encounters a partitioned table
with a dropped column. Dropped columns do not have any translations
saved in AppendInfo::translated_vars; the corresponding entry is NULL
per make_inh_translation_list().
1802 att = TupleDescAttr(old_tupdesc, old_attno);
1803 if (att->attisdropped)
1804 {
1805 /* Just put NULL into this list entry */
1806 vars = lappend(vars, NULL);
1807 continue;
1808 }

In set_append_rel_size() we try to attr_needed for child tables. While
doing so we try to translate a user attribute number of parent to that
of a child and crash since the translated Var is NULL. Here's patch to
fix the crash. The patch also contains a testcase to test dropped
columns in partitioned table.

Sorry for not noticing this problem earlier.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment	Content-Type	Size
0001-Ignore-dropped-columns-in-set_append_rel_size.patch	text/x-patch	2.8 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-10-31 09:15:59
Message-ID:	CA+TgmoZkVte8duqO5QRxp07vbHByUiVmf=MAM8p5BUwG9k-OBA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 16, 2017 at 5:03 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> set_append_rel_size() crashes when it encounters a partitioned table
> with a dropped column. Dropped columns do not have any translations
> saved in AppendInfo::translated_vars; the corresponding entry is NULL
> per make_inh_translation_list().
> 1802 att = TupleDescAttr(old_tupdesc, old_attno);
> 1803 if (att->attisdropped)
> 1804 {
> 1805 /* Just put NULL into this list entry */
> 1806 vars = lappend(vars, NULL);
> 1807 continue;
> 1808 }
>
> In set_append_rel_size() we try to attr_needed for child tables. While
> doing so we try to translate a user attribute number of parent to that
> of a child and crash since the translated Var is NULL. Here's patch to
> fix the crash. The patch also contains a testcase to test dropped
> columns in partitioned table.
>
> Sorry for not noticing this problem earlier.

OK, committed. This is a good example of how having good code
coverage doesn't necessarily mean you've found all the bugs. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-12-01 06:36:00
Message-ID:	CAKcux6=LO-XK9G0yLe634+0SP2UOn5ksVnmF-OntTBOEEaUGTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> OK, committed. This is a good example of how having good code
> coverage doesn't necessarily mean you've found all the bugs. :-)
>
As of now partition_join.sql is not having test cases covering cases
where partition table have default partition, attaching a small test
case patch to cover those.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachment	Content-Type	Size
partition_wise_join_with_default_partitions.patch	text/x-patch	16.7 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-12-01 17:13:01
Message-ID:	CA+TgmoaT8M4v-68MM0UzWZQL9sBghPsbrccwxNoStjz4L4XRjQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 1, 2017 at 1:36 AM, Rajkumar Raghuwanshi
<rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> OK, committed. This is a good example of how having good code
>> coverage doesn't necessarily mean you've found all the bugs. :-)
>>
> As of now partition_join.sql is not having test cases covering cases
> where partition table have default partition, attaching a small test
> case patch to cover those.

That's not that small, and to me it looks like overkill.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-12-04 02:04:43
Message-ID:	CAFjFpRdqVPk+oQ5w8NQuTtDED3nUeX7WEZCEJ_baFA3bh_QQNQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 2, 2017 at 2:13 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Dec 1, 2017 at 1:36 AM, Rajkumar Raghuwanshi
> <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>> On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> OK, committed. This is a good example of how having good code
>>> coverage doesn't necessarily mean you've found all the bugs. :-)
>>>
>> As of now partition_join.sql is not having test cases covering cases
>> where partition table have default partition, attaching a small test
>> case patch to cover those.
>
> That's not that small, and to me it looks like overkill.
>

I agree, the patch looks longer than expected. I think, it's important
to have some testcases to test partition-wise join with default
partitions. I think we need at least one test for range default
partitions, one test for list partitioning, one for multi-level
partitioning and one negative testcase with default partition missing
from one side of join.

May be we could reduce the number of SQL commands and queries in the
patch by adding default partition to every table that participates in
partition-wise join (leave the tables participating in negative tests
aside.). But that's going to increase the size of EXPLAIN outputs and
query results. The negative test may simply drop the default partition
from one of the tables.

For every table being tested, the patch adds two ALTER TABLE commands,
one for detaching an existing partition and then attach the same as
default partition. Alternative to that is just add a new default
partition without detaching and existing partition. But then the
default partition needs to populated with some data, which requires 1
INSERT statement at least. That doesn't reduce the size of patch, but
increases the output of query and EXPLAIN plan.

May be in case of multi-level partitioning test, we don't need to add
DEFAULT in every partitioned relation; adding to one of them would be
enough. May be add it to the parent, but that too can be avoided. That
would reduce the size of patch a bit.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-12-05 05:34:51
Message-ID:	CAKcux6kOQ85Xtzxu3tM1mR7Vk=7Z2e4rG7dL1iMZqPgLMpxQYg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I agree, the patch looks longer than expected. I think, it's important
> to have some testcases to test partition-wise join with default
> partitions. I think we need at least one test for range default
> partitions, one test for list partitioning, one for multi-level
> partitioning and one negative testcase with default partition missing
> from one side of join.
>
> May be we could reduce the number of SQL commands and queries in the
> patch by adding default partition to every table that participates in
> partition-wise join (leave the tables participating in negative tests
> aside.). But that's going to increase the size of EXPLAIN outputs and
> query results. The negative test may simply drop the default partition
> from one of the tables.
>
> For every table being tested, the patch adds two ALTER TABLE commands,
> one for detaching an existing partition and then attach the same as
> default partition. Alternative to that is just add a new default
> partition without detaching and existing partition. But then the
> default partition needs to populated with some data, which requires 1
> INSERT statement at least. That doesn't reduce the size of patch, but
> increases the output of query and EXPLAIN plan.
>
> May be in case of multi-level partitioning test, we don't need to add
> DEFAULT in every partitioned relation; adding to one of them would be
> enough. May be add it to the parent, but that too can be avoided. That
> would reduce the size of patch a bit.

Thanks Ashutosh for suggestions.

I have reduced test cases as suggested. Attaching updated patch.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachment	Content-Type	Size
partition_wise_join_with_default_partitions_v1.patch	text/x-patch	13.2 KB

From:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-12-05 07:54:31
Message-ID:	CAKcux6=h6PJMgeNwPScJ5Zz4P4AZWf+nxgn46eTeOb-prQP14w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 5, 2017 at 11:04 AM, Rajkumar Raghuwanshi
<rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> I agree, the patch looks longer than expected. I think, it's important
>> to have some testcases to test partition-wise join with default
>> partitions. I think we need at least one test for range default
>> partitions, one test for list partitioning, one for multi-level
>> partitioning and one negative testcase with default partition missing
>> from one side of join.
>>
>> May be we could reduce the number of SQL commands and queries in the
>> patch by adding default partition to every table that participates in
>> partition-wise join (leave the tables participating in negative tests
>> aside.). But that's going to increase the size of EXPLAIN outputs and
>> query results. The negative test may simply drop the default partition
>> from one of the tables.
>>
>> For every table being tested, the patch adds two ALTER TABLE commands,
>> one for detaching an existing partition and then attach the same as
>> default partition. Alternative to that is just add a new default
>> partition without detaching and existing partition. But then the
>> default partition needs to populated with some data, which requires 1
>> INSERT statement at least. That doesn't reduce the size of patch, but
>> increases the output of query and EXPLAIN plan.
>>
>> May be in case of multi-level partitioning test, we don't need to add
>> DEFAULT in every partitioned relation; adding to one of them would be
>> enough. May be add it to the parent, but that too can be avoided. That
>> would reduce the size of patch a bit.
>
> Thanks Ashutosh for suggestions.
>
> I have reduced test cases as suggested. Attaching updated patch.
>
Sorry Attached wrong patch.

attaching correct patch now.

Attachment	Content-Type	Size
partition_wise_join_with_default_partitions_v2.patch	text/x-patch	11.0 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-12-20 11:51:47
Message-ID:	CAFjFpRfXqCkxSibu7trrybR9UfQH2hx5BLh_GhtWQPenwwHg9g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 5, 2017 at 1:24 PM, Rajkumar Raghuwanshi
<rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> On Tue, Dec 5, 2017 at 11:04 AM, Rajkumar Raghuwanshi
> <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>> On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> I agree, the patch looks longer than expected. I think, it's important
>>> to have some testcases to test partition-wise join with default
>>> partitions. I think we need at least one test for range default
>>> partitions, one test for list partitioning, one for multi-level
>>> partitioning and one negative testcase with default partition missing
>>> from one side of join.
>>>
>>> May be we could reduce the number of SQL commands and queries in the
>>> patch by adding default partition to every table that participates in
>>> partition-wise join (leave the tables participating in negative tests
>>> aside.). But that's going to increase the size of EXPLAIN outputs and
>>> query results. The negative test may simply drop the default partition
>>> from one of the tables.
>>>
>>> For every table being tested, the patch adds two ALTER TABLE commands,
>>> one for detaching an existing partition and then attach the same as
>>> default partition. Alternative to that is just add a new default
>>> partition without detaching and existing partition. But then the
>>> default partition needs to populated with some data, which requires 1
>>> INSERT statement at least. That doesn't reduce the size of patch, but
>>> increases the output of query and EXPLAIN plan.
>>>
>>> May be in case of multi-level partitioning test, we don't need to add
>>> DEFAULT in every partitioned relation; adding to one of them would be
>>> enough. May be add it to the parent, but that too can be avoided. That
>>> would reduce the size of patch a bit.
>>
>> Thanks Ashutosh for suggestions.
>>
>> I have reduced test cases as suggested. Attaching updated patch.
>>
> Sorry Attached wrong patch.
>
> attaching correct patch now.

Thanks. Here are some comments

+-- test default partition behavior for range
+ALTER TABLE prt1 DETACH PARTITION prt1_p3;
+ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;

I think we need an ANALYZE here in case the statistics gets updated while
DETACH and ATTACH is going on. Other testcases also need to be updated with
ANALYZE, including the negative one.

+-- partition-wise join can not be applied if the only one of joining table have

Correction: ... if only one of the joining tables has ...

Please add the patch to the next commitfest so that it's not
forgotten. I think we can get rid of the multi-level partition-wise
testcase as well. Also, since we are re-attaching existing partition
tables as default partitions, we don't need to check the output as
well; just plan should be enough.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2017-12-22 09:30:28
Message-ID:	CAKcux6nF=wZzztNY=hOcsF5XPvNCB5ryb9A9YvVfycQj47Tk4Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> Thanks. Here are some comments
>
> Thanks Ashutosh for review and suggestions.

> +-- test default partition behavior for range
> +ALTER TABLE prt1 DETACH PARTITION prt1_p3;
> +ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
> +ALTER TABLE prt2 DETACH PARTITION prt2_p3;
> +ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;
>
> I think we need an ANALYZE here in case the statistics gets updated while
> DETACH and ATTACH is going on. Other testcases also need to be updated with
> ANALYZE, including the negative one.
>
Done.

>
> +-- partition-wise join can not be applied if the only one of joining
> table have
>
> Correction: ... if only one of the joining tables has ...
>
Done.

> Please add the patch to the next commitfest so that it's not
> forgotten.

Done.
Added to CF: https://commitfest.postgresql.org/16/1426/

> I think we can get rid of the multi-level partition-wise
> testcase as well. Also, since we are re-attaching existing partition
> tables as default partitions, we don't need to check the output as
> well; just plan should be enough.
>
Ok. Done.

updated test patch attached.

Attachment	Content-Type	Size
partition_wise_join_with_default_partitions_v3.patch	text/x-patch	6.3 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2018-02-07 08:30:32
Message-ID:	CAFjFpRf58mV=QUCJnBKmr-Lq7j6qQjascL=AnY2Qz--snDYmLQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 22, 2017 at 3:00 PM, Rajkumar Raghuwanshi
<rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>
>> Thanks. Here are some comments
>>
> Thanks Ashutosh for review and suggestions.
>
>>
>> +-- test default partition behavior for range
>> +ALTER TABLE prt1 DETACH PARTITION prt1_p3;
>> +ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
>> +ALTER TABLE prt2 DETACH PARTITION prt2_p3;
>> +ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;
>>
>> I think we need an ANALYZE here in case the statistics gets updated while
>> DETACH and ATTACH is going on. Other testcases also need to be updated
>> with
>> ANALYZE, including the negative one.
>
> Done.
>
>>
>>
>> +-- partition-wise join can not be applied if the only one of joining
>> table have
>>
>> Correction: ... if only one of the joining tables has ...
>
> Done.
>
>>
>> Please add the patch to the next commitfest so that it's not
>> forgotten.
>
> Done.
> Added to CF: https://commitfest.postgresql.org/16/1426/
>
>>
>> I think we can get rid of the multi-level partition-wise
>> testcase as well. Also, since we are re-attaching existing partition
>> tables as default partitions, we don't need to check the output as
>> well; just plan should be enough.
>
> Ok. Done.
>
> updated test patch attached.
>

The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2018-03-05 07:13:33
Message-ID:	CAKcux6=8uvNhzHGV+rBs8PWnJ2zmWiOu-HL-1eWRNwyrVzkNvw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> On Fri, Dec 22, 2017 at 3:00 PM, Rajkumar Raghuwanshi
> <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> > updated test patch attached.
>
Changed partition-wise statement to partitionwise.
Attached re-based patch.

> The patch looks good to me. I don't think we can reduce it further.
> But we need some tests to test PWJ with default partitions. Marking
> this as ready for committer.
>
Thanks Ashutosh.

Attachment	Content-Type	Size
partitionwise_join_with_default_partitions_v4.patch	text/x-patch	6.6 KB

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2018-06-06 02:41:13
Message-ID:	CAEepm=2u6yaPhtc_vrZwif6NwyE++NrtHkmpX2egOLr0y6r_0g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi
<rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat
> Changed partition-wise statement to partitionwise.
> Attached re-based patch.
>
>> The patch looks good to me. I don't think we can reduce it further.
>> But we need some tests to test PWJ with default partitions. Marking
>> this as ready for committer.

Hi Rajkumar,

partition_join ... FAILED

The regression test currently fails with your v4 patch because a
redundant Result node has been removed from a query plan. That may be
due to commit 11cf92f6 or nearby commits.

--
Thomas Munro
http://www.enterprisedb.com

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2018-06-06 03:51:54
Message-ID:	CAFjFpRevm9kqKzvUCmF5w4TSSCgXG5uD5i4n9cdTjdSBsAxnTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 6, 2018 at 8:11 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi
> <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>> On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat
>> Changed partition-wise statement to partitionwise.
>> Attached re-based patch.
>>
>>> The patch looks good to me. I don't think we can reduce it further.
>>> But we need some tests to test PWJ with default partitions. Marking
>>> this as ready for committer.
>
> Hi Rajkumar,
>
> partition_join ... FAILED
>

That made my heart stop for fraction of a second. I thought, something
happened which caused partition_join test fail in master. But then I
realised you are talking about Rajkumar's patch and test in that
patch. I think it's better to start a separate thread discussing his
patch, before I loose my heart ;)

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

From:	Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Date:	2018-06-06 06:01:39
Message-ID:	CAKcux6mKmU7EZ0tFvErbv2WrcW3h3Qd9FE0UXxHoQNv0YQBP8Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 6, 2018 at 9:21 AM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> On Wed, Jun 6, 2018 at 8:11 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi
> > <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> >> On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat
> >> Changed partition-wise statement to partitionwise.
> >> Attached re-based patch.
> >>
> >>> The patch looks good to me. I don't think we can reduce it further.
> >>> But we need some tests to test PWJ with default partitions. Marking
> >>> this as ready for committer.
> >
> > Hi Rajkumar,
> >
> > partition_join ... FAILED
> >
>
Thanks Thomas for patch review.

That made my heart stop for fraction of a second. I thought, something
> happened which caused partition_join test fail in master. But then I
> realised you are talking about Rajkumar's patch and test in that
> patch. I think it's better to start a separate thread discussing his
> patch, before I loose my heart ;)

Yeah, that would be better.

here is the new thread with updated patch.
https://www.postgresql.org/message-id/CAKcux6ky5YeZAY74qSh-ayPZZEQchz092g71iXXbC0%2BE3xoscA%40mail.gmail.com

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation