Re: [HACKERS] UPDATE of partition key

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: [HACKERS] UPDATE of partition key
Date: 2017-12-15 21:39:04
Message-ID: CA+TgmoaQR94wpFk2xENWtepQpPnLN3yVBaTo-GB40cK_NgSGeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 15, 2017 at 7:58 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Reviewing the preparatory patch:

I started another review pass over the main patch, so here are some
comments about that. This is unfortunately not a complete review,
however.

- map = ptr->partition_tupconv_maps[leaf_part_index];
+ map = ptr->parentchild_tupconv_maps[leaf_part_index];

I don't think there's any reason to rename this. In previous patch
versions, you had multiple arrays of tuple conversion maps in this
structure, but the refactoring eliminated that.

Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
-> mt_childparent_tupconv_maps. That seems like it could similarly be
left alone.

+ /*
+ * If transition tables are the only reason we're here, return. As
+ * mentioned above, we can also be here during update tuple routing in
+ * presence of transition tables, in which case this function is called
+ * separately for oldtup and newtup, so either can be NULL, not both.
+ */
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
- (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+ (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

I guess this is correct, but it seems awfully fragile. Can't we have
some more explicit signaling about whether we're only here for
transition tables, rather than deducing it based on exactly one of
oldtup and newtup being NULL?

+ /* Initialization specific to update */
+ if (mtstate && mtstate->operation == CMD_UPDATE)
+ {
+ ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+ is_update = true;
+ update_rri = mtstate->resultRelInfo;
+ num_update_rri = list_length(node->plans);
+ }

I guess I don't see why we need a separate "if" block for this.
Neither is_update nor update_rri nor num_update_rri are used until we
get to the block that begins with "if (is_update)". Why not just
change that block to test if (mtstate && mtstate->operation ==
CMD_UPDATE)" and put the rest of these initializations inside that
block?

+ int num_update_rri = 0,
+ update_rri_index = 0;
...
+ update_rri_index = 0;

It's already 0.

+ leaf_part_rri = &update_rri[update_rri_index];
...
+ leaf_part_rri = leaf_part_arr + i;

These are doing the same kind of thing, but using different styles. I
prefer the former style, so I'd change the second one to
&leaf_part_arr[i]. Alternatively, you could change the first one to
update_rri + update_rri_indx. But it's strange to see the same
variable initialized in two different ways just a few lines apart.

+ if (!partrel)
+ {
+ /*
+ * We locked all the partitions above including the leaf
+ * partitions. Note that each of the newly opened relations in
+ * *partitions are eventually closed by the caller.
+ */
+ partrel = heap_open(leaf_oid, NoLock);
+ InitResultRelInfo(leaf_part_rri,
+ partrel,
+ resultRTindex,
+ rel,
+ estate->es_instrument);
+ }

Hmm, isn't there a problem here? Before, we opened all the relations
here and the caller closed them all. But now, we're only opening some
of them. If the caller closes them all, then they will be closing
some that we opened and some that we didn't. That seems quite bad,
because the reference counts that are incremented and decremented by
opening and closing should all end up at 0. Maybe I'm confused
because it seems like this would break in any scenario where even 1
relation was already opened and surely you must have tested that
case... but if there's some reason this works, I don't know what it
is, and the comment doesn't tell me.

+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+ TupleConversionMap *map,
+ HeapTuple tuple,
+ TupleTableSlot *new_slot,
+ TupleTableSlot **p_my_slot)

This function doesn't use the mtstate argument at all.

+ * (Similarly we need to add the deleted row in OLD TABLE). We need to do

The period should be before, not after, the closing parenthesis.

+ * Now that we have already captured NEW TABLE row, any AR INSERT
+ * trigger should not again capture it below. Arrange for the same.

A more American style would be something like "We've already captured
the NEW TABLE row, so make sure any AR INSERT trigger fired below
doesn't capture it again." (Similarly for the other case.)

+ /* The delete has actually happened, so inform that to the caller */
+ if (tuple_deleted)
+ *tuple_deleted = true;

In the US, we inform the caller, not inform that to the caller. In
other words, here the direct object of "inform" is the person or thing
getting the information (in this case, "the caller"), not the
information being conveyed (in this case, "that"). I realize your
usage is probably typical for your country...

+ Assert(mtstate->mt_is_tupconv_perpart == true);

We usually just Assert(thing_that_should_be_true), not
Assert(thing_that_should_be_true == true).

+ * In case this is part of update tuple routing, put this row into the
+ * transition OLD TABLE if we are capturing transition tables. We need to
+ * do this separately for DELETE and INSERT because they happen on
+ * different tables.

Maybe "...OLD table, but only if we are..."

Should it be capturing transition tables or capturing transition
tuples? I'm not sure.

+ * partition, in which case, we should check the RLS CHECK policy just

In the US, the second comma in this sentence is incorrect and should be removed.

+ * When an UPDATE is run with a leaf partition, we would not have
+ * partition tuple routing setup. In that case, fail with

run with -> run on
would not -> will not
setup -> set up

+ * deleted by another transaction), then we should skip INSERT as
+ * well, otherwise, there will be effectively one new row inserted.

skip INSERT -> skip the insert
well, otherwise -> well; otherwise

I would also change "there will be effectively one new row inserted"
to "an UPDATE could cause an increase in the total number of rows
across all partitions, which is clearly wrong".

+ /*
+ * UPDATEs set the transition capture map only when a new subplan
+ * is chosen. But for INSERTs, it is set for each row. So after
+ * INSERT, we need to revert back to the map created for UPDATE;
+ * otherwise the next UPDATE will incorrectly use the one created
+ * for INESRT. So first save the one created for UPDATE.
+ */
+ if (mtstate->mt_transition_capture)
+ saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

UPDATEs -> Updates
INESRT -> INSERT

I wonder if there is some more elegant way to handle this problem.
Basically, the issue is that ExecInsert() is stomping on
mtstate->mt_transition_capture, and your solution is to save and
restore the value you want to have there. But maybe we could instead
find a way to get ExecInsert() not to stomp on that state in the first
place. It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture. By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go. It seems a bit unsatisfying, but so does
what you have now.

+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need

This isn't worded well. A transition table is never a partition;
transition tables and partitions are two different kinds of things.

+ * If per-leaf map is required and the map is already created, that map
+ * has to be per-leaf. If that map is per-subplan, we won't be able to
+ * access the maps leaf-partition-wise. But if the map is per-leaf, we
+ * will be able to access the maps subplan-wise using the
+ * subplan_partition_offsets map using function
+ * tupconv_map_for_subplan(). So if the callers might need to access
+ * the map both leaf-partition-wise and subplan-wise, they should make
+ * sure that the first time this function is called, it should be
+ * called with perleaf=true so that the map created is per-leaf, not
+ * per-subplan.

This sounds complicated and fragile. It ends up meaning that
mt_childparent_tupconv_maps is sometimes indexed by subplan number and
sometimes by partition leaf index, which is extremely confusing and
likely to lead to coding errors, either in this patch or in future
ones. Would it be reasonable to just always do this by partition leaf
index, even if we don't strictly need that set of mappings?

That's all I've got for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2017-12-15 21:52:50 Re: [HACKERS] pgbench more operators & functions
Previous Message Thomas Munro 2017-12-15 21:13:14 Re: Top-N sorts verses parallelism