WIP patch for multiple column assignment in UPDATE

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: WIP patch for multiple column assignment in UPDATE
Date: 2014-06-16 15:17:21
Message-ID: 20930.1402931841@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Attached is a very-much-WIP patch for supporting
UPDATE foo SET ..., (a,b,c) = (select x,y,z from ...), ...

It lacks documentation, ruleutils.c support, or any solution for the
rule NEW.* expansion problem I mentioned Saturday. The reason I'm
posting it now is to get feedback about an implementation choice that
feels a bit klugy to me; but I don't see any clearly better idea.

The post-parse-analysis representation I've chosen here is that the
output columns of the sub-select are represented by PARAM_MULTIEXEC
Params, and the sub-select itself appears in a resjunk entry at the end
of the targetlist; that is, the UPDATE tlist is more or less like

$1, -- to be assigned to a
$2, -- to be assigned to b
$3, -- to be assigned to c
(select x,y,z from ...), -- resjunk entry, value will be discarded

If the sub-select is uncorrelated with the outer query, the planner
turns it into an InitPlan, replacing the resjunk tlist item with a
NULL constant, and then everything happens normally at execution.

But more usually, the sub-select will be correlated with the outer
query. In this case, the subquery turns into a SubPlan tree that
stays in the resjunk item. At the start of execution, the ParamExecData
structs for each of its output Params are marked with execPlan pointers
to the subplan, just as would happen for an InitPlan. This causes the
subplan to get executed when the first of the output Params is evaluated;
it loads the ParamExecData structs for all its output Params, and then
the later Params just take data from the structs. When execution reaches
the MULTIEXEC SubPlan in the resjunk tlist item, no evaluation of the
subplan is needed; but instead we re-mark all the output ParamExecData
structs with non-null execPlan pointers, so that a fresh execution of
the subplan will happen in the next evaluation of the targetlist.

The klugy aspect of this is that it assumes that the SubPlan item will
appear later in the tlist than all the Params referencing it. This is
true at the moment because resjunk tlist items always appear after
non-resjunk ones. There are a few other places that already depend on
this ordering, but we've mostly tried to avoid introducing new
dependencies on it.

The alternative that I'd originally had in mind, before put-it-in-a-
resjunk-item occurred to me, was to invent a new "secondary tlist"
field of Query and of ModifyTable plan nodes, as I sketched back in
http://www.postgresql.org/message-id/1783.1399054541@sss.pgh.pa.us
We'd put the MULTIEXEC SubPlans in this secondary tlist and expect
the executor to evaluate it just before evaluating the main tlist.
However, that way isn't terribly pretty either, because it extends
knowledge of this feature to a *whole lot* of places that don't have
to know about it in the attached approach; in particular, just about
every place that manipulates targetlist contents would have to also
manipulate the secondary tlist.

Another approach is to explicitly identify which of the Params will
be evaluated first and replace it with a node tree that evaluates
the subplan (and sets output Params for the remaining columns).
This is a bit messy because the first-to-be-evaluated is dependent
on the targetlist reordering that the planner does; so we don't want
parse analysis to try to do it. (If we allowed parse analysis to
know that the planner will sort the tlist items into physical column
order, we could do it like that; but then it would break if we ever
get around to allowing ALTER TABLE to change the physical order.)
We could safely have setrefs.c determine the first-to-be-evaluated
Param, though, since it will traverse the tlist in final order.
So if we went with this approach I'd have setrefs.c replace the first
Param with a SubPlan node. That seems a bit of a kluge too though.

Preferences, comments, better ideas?

regards, tom lane

Attachment Content-Type Size
multi-assignment-1.patch text/x-diff 59.6 KB

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch for multiple column assignment in UPDATE
Date: 2014-06-19 07:02:34
Message-ID: CAFj8pRBJko7YNNLNzk2MCQ50XJJzNf+1ODcgno4FuRRts9KOyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I did some tests and It looks so it allows only some form of nested loop.

postgres=# explain (analyze, timing off, buffers) update a1 set b = (select
b from a2 where a1.a = a2.a);
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------
Update on a1 (cost=0.00..8456925.00 rows=1000000 width=10) (actual rows=0
loops=1)
Buffers: shared hit=9017134 read=14376 dirtied=58170 written=1014
-> Seq Scan on a1 (cost=0.00..8456925.00 rows=1000000 width=10)
(actual rows=1000000 loops=1)
Buffers: shared hit=4005465 read=4424 written=971
SubPlan 1
-> Index Scan using a2_a_idx on a2 (cost=0.42..8.44 rows=1
width=4) (actual rows=1 loops=1000000)
Index Cond: (a1.a = a)
Buffers: shared hit=4005464
Planning time: 0.212 ms
Execution time: 30114.101 ms
(10 rows)

do you plan some sophisticated mechanism - like MERGE or some similar?

Regards

Pavel

2014-06-16 17:17 GMT+02:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Attached is a very-much-WIP patch for supporting
> UPDATE foo SET ..., (a,b,c) = (select x,y,z from ...), ...
>
> It lacks documentation, ruleutils.c support, or any solution for the
> rule NEW.* expansion problem I mentioned Saturday. The reason I'm
> posting it now is to get feedback about an implementation choice that
> feels a bit klugy to me; but I don't see any clearly better idea.
>
> The post-parse-analysis representation I've chosen here is that the
> output columns of the sub-select are represented by PARAM_MULTIEXEC
> Params, and the sub-select itself appears in a resjunk entry at the end
> of the targetlist; that is, the UPDATE tlist is more or less like
>
> $1, -- to be assigned to a
> $2, -- to be assigned to b
> $3, -- to be assigned to c
> (select x,y,z from ...), -- resjunk entry, value will be discarded
>
> If the sub-select is uncorrelated with the outer query, the planner
> turns it into an InitPlan, replacing the resjunk tlist item with a
> NULL constant, and then everything happens normally at execution.
>
> But more usually, the sub-select will be correlated with the outer
> query. In this case, the subquery turns into a SubPlan tree that
> stays in the resjunk item. At the start of execution, the ParamExecData
> structs for each of its output Params are marked with execPlan pointers
> to the subplan, just as would happen for an InitPlan. This causes the
> subplan to get executed when the first of the output Params is evaluated;
> it loads the ParamExecData structs for all its output Params, and then
> the later Params just take data from the structs. When execution reaches
> the MULTIEXEC SubPlan in the resjunk tlist item, no evaluation of the
> subplan is needed; but instead we re-mark all the output ParamExecData
> structs with non-null execPlan pointers, so that a fresh execution of
> the subplan will happen in the next evaluation of the targetlist.
>
> The klugy aspect of this is that it assumes that the SubPlan item will
> appear later in the tlist than all the Params referencing it. This is
> true at the moment because resjunk tlist items always appear after
> non-resjunk ones. There are a few other places that already depend on
> this ordering, but we've mostly tried to avoid introducing new
> dependencies on it.
>
> The alternative that I'd originally had in mind, before put-it-in-a-
> resjunk-item occurred to me, was to invent a new "secondary tlist"
> field of Query and of ModifyTable plan nodes, as I sketched back in
> http://www.postgresql.org/message-id/1783.1399054541@sss.pgh.pa.us
> We'd put the MULTIEXEC SubPlans in this secondary tlist and expect
> the executor to evaluate it just before evaluating the main tlist.
> However, that way isn't terribly pretty either, because it extends
> knowledge of this feature to a *whole lot* of places that don't have
> to know about it in the attached approach; in particular, just about
> every place that manipulates targetlist contents would have to also
> manipulate the secondary tlist.
>
> Another approach is to explicitly identify which of the Params will
> be evaluated first and replace it with a node tree that evaluates
> the subplan (and sets output Params for the remaining columns).
> This is a bit messy because the first-to-be-evaluated is dependent
> on the targetlist reordering that the planner does; so we don't want
> parse analysis to try to do it. (If we allowed parse analysis to
> know that the planner will sort the tlist items into physical column
> order, we could do it like that; but then it would break if we ever
> get around to allowing ALTER TABLE to change the physical order.)
> We could safely have setrefs.c determine the first-to-be-evaluated
> Param, though, since it will traverse the tlist in final order.
> So if we went with this approach I'd have setrefs.c replace the first
> Param with a SubPlan node. That seems a bit of a kluge too though.
>
> Preferences, comments, better ideas?
>
> regards, tom lane
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch for multiple column assignment in UPDATE
Date: 2014-06-19 13:37:21
Message-ID: 28860.1403185041@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
> I did some tests and It looks so it allows only some form of nested loop.

[ shrug... ] It's a subplan. One evaluation per outer row is what
people are expecting.

regards, tom lane


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch for multiple column assignment in UPDATE
Date: 2014-06-19 21:11:47
Message-ID: CAFj8pRCjZ_6mu_kMKRD8oe8KKPnbNOwCTUBxY2nL=2-HDH2zDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2014-06-19 15:37 GMT+02:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
> > I did some tests and It looks so it allows only some form of nested loop.
>
> [ shrug... ] It's a subplan. One evaluation per outer row is what
> people are expecting.
>

ok

regards

Pavel

>
> regards, tom lane
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch for multiple column assignment in UPDATE
Date: 2014-06-23 14:51:28
Message-ID: CA+TgmoYNwPsVwMn4oxcbhGNJ9=S-K+VjH8rkwyEaMkH60au09A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 19, 2014 at 9:37 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
>> I did some tests and It looks so it allows only some form of nested loop.
>
> [ shrug... ] It's a subplan. One evaluation per outer row is what
> people are expecting.

Is it theoretically possible to convert a construct like this to a
semi-join? I notice we don't, even when this new syntax isn't used:

rhaas=# explain select a, (select b from bar where foo.a = bar.a) from foo;
QUERY PLAN
------------------------------------------------------------
Seq Scan on foo (cost=0.00..855145.00 rows=10000 width=4)
SubPlan 1
-> Seq Scan on bar (cost=0.00..85.50 rows=1 width=4)
Filter: (foo.a = a)
Planning time: 0.078 ms

...but I'm wondering if that's an unimplemented feature or if there's
some reason why it's a bad idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company