Re: EvalPlanQual seems a tad broken

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: EvalPlanQual seems a tad broken
Date: 2009-10-12 15:42:08
Message-ID: 15588.1255362128@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While fooling around with moving FOR UPDATE into a plan node (WIP
version attached), I came across this interesting behavior:

1. Create test tables:

create table t1(f1 int, f2 int);

insert into t1 values (1,1);
insert into t1 values (2,2);

create table t2(f3 int, f4 int);

insert into t2 values (1,111);
insert into t2 values (1,112);
insert into t2 values (2,113);
insert into t2 values (2,114);

2. Execute test query:

select * from t1 join t2 on f1=f3 for update;

f1 | f2 | f3 | f4
----+----+----+-----
1 | 1 | 1 | 111
1 | 1 | 1 | 112
2 | 2 | 2 | 113
2 | 2 | 2 | 114
(4 rows)

3. In another session, execute:

begin;
update t1 set f2=42 where f1=2;

4. In first session, repeat test query:

select * from t1 join t2 on f1=f3 for update;

As expected, it blocks waiting for the second session to commit.

5. Now commit in the second session. First session resumes and prints

f1 | f2 | f3 | f4
----+----+----+-----
1 | 1 | 1 | 111
1 | 1 | 1 | 112
2 | 42 | 2 | 113
2 | 42 | 2 | 114
2 | 42 | 2 | 113
2 | 42 | 2 | 114
(6 rows)

Of course the expected answer is

f1 | f2 | f3 | f4
----+----+----+-----
1 | 1 | 1 | 111
1 | 1 | 1 | 112
2 | 42 | 2 | 113
2 | 42 | 2 | 114
(4 rows)

which is what you'll get if you simply repeat the test query.

This isn't a new bug ... the same behavior can be demonstrated as far
back as 7.0.

The problem is that execMain.c is set up to pull as many rows as it can
from execution of an EvalPlanQual query. Then, once that's exhausted,
it steps to the next result from the original query. So if a row that
requires locking joins to more than one row in some other table, you
get duplicated output rows.

The attached patch alleviates some cases of this by having the new
LockRows plan node lock *all* the target rows, not just one, before
firing the EvalPlanQual query. It doesn't fix the example above
because only one of the rows being joined is seen to need EPQ treatment.
We could improve that by feeding successfully locked rows into the EPQ
machinery as well as ones that were found to be outdated. But that
would still leave us with two failure cases:

1. if some of the tables being joined are not selected FOR UPDATE.

2. if the select involves any set-returning functions in the targetlist.

I think we could get around #1 by having *all* tables in the query
marked FOR UPDATE at least in a dummy form, ie give them entries in
the rowMarks list and create junk tlist entries to report their current
ctid. Then we'd feed all the relevant rows into the EPQ machinery.
We'd just not lock the ones we weren't asked to lock.

I do not see any very good way around #2. I'm tempted to propose
that we just forbid SRFs in the targetlist of a FOR UPDATE query.
This could be justified on the same grounds that we forbid aggregate
functions there, ie, they destroy the one-to-one correspondence between
table rows and SELECT output rows. If you really had to have it you
could do something like

select srf(...) from (select ... for update) ss;

Anyone have a better idea?

regards, tom lane

Attachment Content-Type Size
lockrows-1.patch.gz application/octet-stream 19.3 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: EvalPlanQual seems a tad broken
Date: 2009-10-12 18:51:20
Message-ID: 1255373480.15590.259.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2009-10-12 at 11:42 -0400, Tom Lane wrote:

> The problem is that execMain.c is set up to pull as many rows as it can
> from execution of an EvalPlanQual query. Then, once that's exhausted,
> it steps to the next result from the original query. So if a row that
> requires locking joins to more than one row in some other table, you
> get duplicated output rows.

Surely the SQL Standard recognises such queries as failing test 23b) on
7.12 <query specification> (p379, SQL2008 Foundation). So the query is
not potentially updateable and should give a runtime error using a FOR
UPDATE.

Can we add something to check for uniqueness of the join and throw an
error when we detect this situation? Seems like a better general
solution.

> I do not see any very good way around #2. I'm tempted to propose
> that we just forbid SRFs in the targetlist of a FOR UPDATE query.
> This could be justified on the same grounds that we forbid aggregate
> functions there, ie, they destroy the one-to-one correspondence between
> table rows and SELECT output rows. If you really had to have it you
> could do something like

Until we have a way to specify that the return set from an SRF is
unique, that seems the only way. That would be desirable because it
would allow FunctionScans to be join removed as well.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: EvalPlanQual seems a tad broken
Date: 2009-10-12 19:29:04
Message-ID: 4990.1255375744@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Mon, 2009-10-12 at 11:42 -0400, Tom Lane wrote:
>> The problem is that execMain.c is set up to pull as many rows as it can
>> from execution of an EvalPlanQual query. Then, once that's exhausted,
>> it steps to the next result from the original query. So if a row that
>> requires locking joins to more than one row in some other table, you
>> get duplicated output rows.

> Surely the SQL Standard recognises such queries as failing test 23b) on
> 7.12 <query specification> (p379, SQL2008 Foundation). So the query is
> not potentially updateable and should give a runtime error using a FOR
> UPDATE.

You're confusing our implementation of FOR UPDATE with the spec.
As I said earlier, they are only very loosely related. In particular,
our reading of FOR UPDATE is to lock the referenced rows, not to enforce
that they are referenced only once.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgreSQL(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: EvalPlanQual seems a tad broken
Date: 2009-10-13 14:38:12
Message-ID: 4AD44A84020000250002B8E6@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> 5. Now commit in the second session. First session resumes and
> prints
>
> f1 | f2 | f3 | f4
> ----+----+----+-----
> 1 | 1 | 1 | 111
> 1 | 1 | 1 | 112
> 2 | 42 | 2 | 113
> 2 | 42 | 2 | 114
> 2 | 42 | 2 | 113
> 2 | 42 | 2 | 114
> (6 rows)
>
> Of course the expected answer is
>
> f1 | f2 | f3 | f4
> ----+----+----+-----
> 1 | 1 | 1 | 111
> 1 | 1 | 1 | 112
> 2 | 42 | 2 | 113
> 2 | 42 | 2 | 114
> (4 rows)
>
> which is what you'll get if you simply repeat the test query.

Is this related to issue 4593? (SELECT FOR UPDATE can return results
in a sequence inconsistent with actual result rows and the ORDER BY
clause -- rows are ordered by the pre-UPDATE values, while the results
show the post-UPDATE values.)

http://archives.postgresql.org/pgsql-bugs/2009-01/msg00017.php

On the face of it, it seems very similar. Will the patch address this
anomaly of SELECT FOR UPDATE, too?

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: EvalPlanQual seems a tad broken
Date: 2009-10-13 14:45:03
Message-ID: 29049.1255445103@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Is this related to issue 4593?

No, not directly. Now that locking is done in a separate plan node,
we could think about addressing #4593 by switching the order of the
LockRows and Sort plan nodes, but that has nothing to do with how
well EvalPlanQual works.

I was planning to start a separate thread discussing whether to do that,
but it's a bit moot until we have a fix for EvalPlanQual --- at least
one of the possible answers is to throw out EvalPlanQual altogether and
do something else, in which case LockRows might go away again.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: EvalPlanQual seems a tad broken
Date: 2009-10-22 18:12:16
Message-ID: 3068.1256235136@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> [ EvalPlanQual does not work well ]

I'm planning to go back to work on this now that we're out of
CommitFest.

> We could improve that by feeding successfully locked rows into the EPQ
> machinery as well as ones that were found to be outdated. But that
> would still leave us with two failure cases:

> 1. if some of the tables being joined are not selected FOR UPDATE.

> 2. if the select involves any set-returning functions in the targetlist.

> I think we could get around #1 by having *all* tables in the query
> marked FOR UPDATE at least in a dummy form, ie give them entries in
> the rowMarks list and create junk tlist entries to report their current
> ctid. Then we'd feed all the relevant rows into the EPQ machinery.
> We'd just not lock the ones we weren't asked to lock.

On further review it seems that a better way to do this is to make
things happen inside the EPQ machinery. We need to "freeze" the rows
returned by *all* scan nodes, not only the ones referencing real tables
--- for example, a join against a VALUES scan node would still be a
problem if we don't lock down the VALUES output, since we could end up
getting multiple join rows out. This means we can't assume that there
is a CTID associated with every scan node that EPQ needs to lock down.
What looks like it would work instead is to pass through the current
scan tuple for every scan plan node, not only the ones that are FOR
UPDATE targets. I'm tempted to try to move the responsibility for this
into execScan.c instead of having all the individual scan plan types
know about it.

> I do not see any very good way around #2. I'm tempted to propose
> that we just forbid SRFs in the targetlist of a FOR UPDATE query.
> This could be justified on the same grounds that we forbid aggregate
> functions there, ie, they destroy the one-to-one correspondence between
> table rows and SELECT output rows. If you really had to have it you
> could do something like
> select srf(...) from (select ... for update) ss;

This still seems like a necessary restriction unfortunately :-(.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: EvalPlanQual seems a tad broken
Date: 2009-10-23 13:41:33
Message-ID: 9679.1256305293@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> On further review it seems that a better way to do this is to make
> things happen inside the EPQ machinery. We need to "freeze" the rows
> returned by *all* scan nodes, not only the ones referencing real tables
> --- for example, a join against a VALUES scan node would still be a
> problem if we don't lock down the VALUES output, since we could end up
> getting multiple join rows out. This means we can't assume that there
> is a CTID associated with every scan node that EPQ needs to lock down.
> What looks like it would work instead is to pass through the current
> scan tuple for every scan plan node, not only the ones that are FOR
> UPDATE targets. I'm tempted to try to move the responsibility for this
> into execScan.c instead of having all the individual scan plan types
> know about it.

What I had been thinking of when I wrote that was to pass down the
ScanTupleSlots from the outer query's scan nodes. That codes up
nicely but doesn't work at all :-(. As is obvious in hindsight,
the scan nodes are not necessarily still returning the same tuples
that contribute to the current join tuple --- for instance if you
have a sort-and-mergejoin type of plan, all the scans will be at
EOF by the time the top level sees any tuples.

So we need to be able to recover the original scan tuples from the
join tuple, even for scans that are not to be locked. For real
tables this isn't hard, we can pass up the CTID as a junk column
the same as we do for tables that are to be locked. It's harder
for non-table scans though (VALUES, functions, etc). I can see
two conceivably workable alternatives:

1. Pass up the entire row value as a junk whole-row Var.

2. Invent some equivalent to CTID that would allow the row to be
recovered later --- for instance, the row number in a tuplestore.
One problem with this is that not all those scan types use a tuplestore
now, so we'd be adding significant overhead. Also, I'm not entirely
sure that it can work for scan nodes that get reset and rescanned
repeatedly. Some of them clear and refill the tuplestore when they
do that, and the refill isn't necessarily the same row values.

Fortunately, this case probably doesn't arise that much in practice,
so while it needs to work I doubt that performance is critical.

I'm planning to try alternative #1 next, but I wonder if anyone
has a better idea?

regards, tom lane