Re: 9.2rc1 produces incorrect results

Lists: pgsql-hackers
From: Vik Reykja <vikreykja(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 9.2rc1 produces incorrect results
Date: 2012-09-04 12:38:41
Message-ID: CALDgxVvXrG3gxu0szPQ1XHkNb3bCUUDQCojnW=zg+nSAF6iaQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello. It took me a while to get a version of this that was independent of
my data, but here it is. I don't understand what's going wrong but if you
change any part of this query (or at least any part I tried), the correct
result is returned.

This script will reproduce it:

=====

create table t1 (id integer primary key);
create table t2 (id integer primary key references t1 (id));

insert into t1 (id) select generate_series(1, 100000); -- size matters
insert into t2 (id) values (1); -- get a known value in the table
insert into t2 (id) select g from generate_series(2, 100000) g where
random() < 0.01; -- size matters again

analyze t1;
analyze t2;

with
A as (
select t2.id,
t2.id = 1 as is_something
from t2
join t1 on t1.id = t2.id
left join pg_class pg_c on pg_c.relname = t2.id::text -- I haven't
tried on a user table
where pg_c.oid is null
),

B as (
select A.id,
row_number() over (partition by A.id) as order -- this seems
to be important, too
from A
)

select A.id, array(select B.id from B where B.id = A.id) from A where
A.is_something
union all
select A.id, array(select B.id from B where B.id = A.id) from A where
A.is_something;

=====

As you can (hopefully) see, the two UNIONed queries are identical but do
not return the same values. I wish I had the skills to attach a patch to
this message, but alas I do not.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-04 15:04:39
Message-ID: 17483.1346771079@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Vik Reykja <vikreykja(at)gmail(dot)com> writes:
> Hello. It took me a while to get a version of this that was independent of
> my data, but here it is. I don't understand what's going wrong but if you
> change any part of this query (or at least any part I tried), the correct
> result is returned.

Huh. 9.1 gets the wrong answer too, so this isn't a (very) new bug;
but curiously, 8.4 and 9.0 seem to get it right. I think this is
probably related somehow to Adam Mackler's recent report --- multiple
scans of the same CTE seems to be a bit of a soft spot :-(

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-04 18:52:34
Message-ID: 23021.1346784754@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Vik Reykja <vikreykja(at)gmail(dot)com> writes:
>> Hello. It took me a while to get a version of this that was independent of
>> my data, but here it is. I don't understand what's going wrong but if you
>> change any part of this query (or at least any part I tried), the correct
>> result is returned.

> Huh. 9.1 gets the wrong answer too, so this isn't a (very) new bug;
> but curiously, 8.4 and 9.0 seem to get it right. I think this is
> probably related somehow to Adam Mackler's recent report --- multiple
> scans of the same CTE seems to be a bit of a soft spot :-(

No, I'm mistaken: it's a planner bug. The plan looks like this:

QUERY PLAN
-------------------------------------------------------------------------------------------
Result (cost=281.29..290.80 rows=20 width=36)
CTE a
-> Nested Loop (cost=126.96..280.17 rows=19 width=4)
-> Merge Right Join (cost=126.96..220.17 rows=19 width=4)
Merge Cond: (((pg_c.relname)::text) = ((t2.id)::text))
Filter: (pg_c.oid IS NULL)
-> Sort (cost=61.86..63.72 rows=743 width=68)
Sort Key: ((pg_c.relname)::text)
-> Seq Scan on pg_class pg_c (cost=0.00..26.43 rows=743 width=68)
-> Sort (cost=65.10..67.61 rows=1004 width=4)
Sort Key: ((t2.id)::text)
-> Seq Scan on t2 (cost=0.00..15.04 rows=1004 width=4)
-> Index Scan using t1_pkey on t1 (cost=0.00..3.14 rows=1 width=4)
Index Cond: (id = t2.id***)
CTE b
-> WindowAgg (cost=0.78..1.12 rows=19 width=4)
-> Sort (cost=0.78..0.83 rows=19 width=4)
Sort Key: a.id
-> CTE Scan on a (cost=0.00..0.38 rows=19 width=4)
-> Append (cost=0.00..9.51 rows=20 width=36)
-> CTE Scan on a (cost=0.00..4.66 rows=10 width=4)
Filter: is_something
SubPlan 3
-> CTE Scan on b (cost=0.00..0.43 rows=1 width=4)
Filter: (id = a.id***)
-> CTE Scan on a (cost=0.00..4.66 rows=10 width=4)
Filter: is_something
SubPlan 4
-> CTE Scan on b (cost=0.00..0.43 rows=1 width=4)
Filter: (id = a.id***)
(30 rows)

The planner is assigning a single PARAM_EXEC slot for the parameter
passed into the inner indexscan in "CTE a" and for the parameters passed
into the two SubPlans that scan "CTE b" (the items I marked with "***"
above). This is safe enough so far as the two SubPlans are concerned,
because they don't execute concurrently --- but when SubPlan 3 is first
fired, it causes the remainder of CTE a to be computed, so that the
nestloop gets iterated some more times, and that overwrites the value of
"a.id" that already got passed down into SubPlan 3.

The reason this is so hard to replicate is that the PARAM_EXEC slot can
only get re-used for identical-looking Vars (same varno, varlevelsup,
vartype, etc) --- so even granted that you've got the right shape of
plan, minor "unrelated" changes in the query can stop the aliasing
from occurring. Also, inner indexscans weren't using the PARAM_EXEC
mechanism until 9.1, so that's why the example doesn't fail before 9.1.

I've always been a bit suspicious of the theory espoused in
replace_outer_var that aliasing different Params is OK:

* NOTE: in sufficiently complex querytrees, it is possible for the same
* varno/abslevel to refer to different RTEs in different parts of the
* parsetree, so that different fields might end up sharing the same Param
* number. As long as we check the vartype/typmod as well, I believe that
* this sort of aliasing will cause no trouble. The correct field should
* get stored into the Param slot at execution in each part of the tree.

but I've never seen a provably wrong case before. Most likely, this has
been broken since we introduced CTEs in 8.4: it's the decoupled timing
of execution of main query and CTE that's needed to allow execution of
different parts of the plan tree to overlap and thus create the risk.

(I get the impression that only recently have people been writing really
complex CTE queries, since we found another fundamental bug with them
just recently.)

I think probably the best fix is to rejigger things so that Params
assigned by different executions of SS_replace_correlation_vars and
createplan.c can't share PARAM_EXEC numbers. This will result in
rather larger ecxt_param_exec_vals arrays at runtime, but the array
entries aren't very large, so I don't think it'll matter.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-05 04:09:03
Message-ID: 3521.1346818143@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I think probably the best fix is to rejigger things so that Params
> assigned by different executions of SS_replace_correlation_vars and
> createplan.c can't share PARAM_EXEC numbers. This will result in
> rather larger ecxt_param_exec_vals arrays at runtime, but the array
> entries aren't very large, so I don't think it'll matter.

Attached is a draft patch against HEAD for this. I think it makes the
planner's handling of outer-level Params far less squishy than it's ever
been, but it is rather a large change. Not sure whether to risk pushing
it into 9.2 right now, or wait till after we cut 9.2.0 ... thoughts?

regards, tom lane

Attachment Content-Type Size
planner-param-handling-1.patch text/x-patch 47.5 KB

From: Vik Reykja <vikreykja(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-05 07:50:44
Message-ID: CALDgxVuLr6ZVUKupFooQELCV6m8mSGKvR2jCJazgKbywcxfxUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 5, 2012 at 6:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I wrote:
> > I think probably the best fix is to rejigger things so that Params
> > assigned by different executions of SS_replace_correlation_vars and
> > createplan.c can't share PARAM_EXEC numbers. This will result in
> > rather larger ecxt_param_exec_vals arrays at runtime, but the array
> > entries aren't very large, so I don't think it'll matter.
>
> Attached is a draft patch against HEAD for this. I think it makes the
> planner's handling of outer-level Params far less squishy than it's ever
> been, but it is rather a large change. Not sure whether to risk pushing
> it into 9.2 right now, or wait till after we cut 9.2.0 ... thoughts?
>

I am not in a position to know what's best for the project but my company
can't upgrade (from 9.0) until this is fixed. We'll wait for 9.2.1 if we
have to. After all, we skipped 9.1.


From: Thom Brown <thom(at)linux(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Vik Reykja <vikreykja(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-05 08:48:16
Message-ID: CAA-aLv65fqwCyWCawaQOKAdQ0X4-tXUp-Km1E8LEg6SDjZ8oMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5 September 2012 05:09, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> I think probably the best fix is to rejigger things so that Params
>> assigned by different executions of SS_replace_correlation_vars and
>> createplan.c can't share PARAM_EXEC numbers. This will result in
>> rather larger ecxt_param_exec_vals arrays at runtime, but the array
>> entries aren't very large, so I don't think it'll matter.
>
> Attached is a draft patch against HEAD for this. I think it makes the
> planner's handling of outer-level Params far less squishy than it's ever
> been, but it is rather a large change. Not sure whether to risk pushing
> it into 9.2 right now, or wait till after we cut 9.2.0 ... thoughts?

Just so someone else has tested the case in question, here's the
result this end:

id | array
----+-------
1 | {1}
1 | {1}
(2 rows)

QUERY PLAN
-------------------------------------------------------------------------------------------
Result (cost=131.45..133.07 rows=8 width=36)
CTE a
-> Nested Loop (cost=87.18..131.09 rows=7 width=4)
-> Merge Right Join (cost=87.18..123.33 rows=7 width=4)
Merge Cond: (((pg_c.relname)::text) = ((t2.id)::text))
Filter: (pg_c.oid IS NULL)
-> Sort (cost=22.82..23.55 rows=291 width=68)
Sort Key: ((pg_c.relname)::text)
-> Seq Scan on pg_class pg_c
(cost=0.00..10.91 rows=291 width=68)
-> Sort (cost=64.36..66.84 rows=993 width=4)
Sort Key: ((t2.id)::text)
-> Seq Scan on t2 (cost=0.00..14.93 rows=993 width=4)
-> Index Only Scan using t1_pkey on t1 (cost=0.00..1.10
rows=1 width=4)
Index Cond: (id = t2.id)
CTE b
-> WindowAgg (cost=0.24..0.36 rows=7 width=4)
-> Sort (cost=0.24..0.26 rows=7 width=4)
Sort Key: a.id
-> CTE Scan on a (cost=0.00..0.14 rows=7 width=4)
-> Append (cost=0.00..1.62 rows=8 width=36)
-> CTE Scan on a (cost=0.00..0.77 rows=4 width=4)
Filter: is_something
SubPlan 3
-> CTE Scan on b (cost=0.00..0.16 rows=1 width=4)
Filter: (id = a.id)
-> CTE Scan on a (cost=0.00..0.77 rows=4 width=4)
Filter: is_something
SubPlan 4
-> CTE Scan on b (cost=0.00..0.16 rows=1 width=4)
Filter: (id = a.id)
(30 rows)

As for shipping without the fix, I'm torn on whether to do so or not.
I imagine most productions will wait for a .1 or .2 release, and use
.0 for migration testing. Plus this bug hasn't been hit (or at least
not noticed) during 5 releases of 9.1, and there isn't enough time
left before shipping to expose the changes to enough testing in the
areas affected, so I'd be slightly inclined to push this into 9.1.6
and 9.2.1.

Regards

Thom


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thom Brown <thom(at)linux(dot)com>
Cc: Vik Reykja <vikreykja(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-05 14:28:37
Message-ID: 15956.1346855317@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thom Brown <thom(at)linux(dot)com> writes:
> On 5 September 2012 05:09, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Attached is a draft patch against HEAD for this. I think it makes the
>> planner's handling of outer-level Params far less squishy than it's ever
>> been, but it is rather a large change. Not sure whether to risk pushing
>> it into 9.2 right now, or wait till after we cut 9.2.0 ... thoughts?

> As for shipping without the fix, I'm torn on whether to do so or not.
> I imagine most productions will wait for a .1 or .2 release, and use
> .0 for migration testing. Plus this bug hasn't been hit (or at least
> not noticed) during 5 releases of 9.1, and there isn't enough time
> left before shipping to expose the changes to enough testing in the
> areas affected, so I'd be slightly inclined to push this into 9.1.6
> and 9.2.1.

Yeah, after sleeping on it that's my feeling as well. The patch needs
some rework for back branches anyway (since a nontrivial part of it
is touching LATERAL support that doesn't exist before HEAD). I'll
push the fix to HEAD but wait till after 9.2.0 wrap for the back
branches.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thom Brown <thom(at)linux(dot)com>, Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-05 15:27:16
Message-ID: 17407.1346858836@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

BTW, after considerable fooling around with Vik's example, I've been
able to produce a regression test case that fails in all PG versions
with WITH:

with
A as ( select q2 as id, (select q1) as x from int8_tbl ),
B as ( select id, row_number() over (partition by id) as r from A ),
C as ( select A.id, array(select B.id from B where B.id = A.id) from A )
select * from C;

The correct answer to this is

id | array
-------------------+-------------------------------------
456 | {456}
4567890123456789 | {4567890123456789,4567890123456789}
123 | {123}
4567890123456789 | {4567890123456789,4567890123456789}
-4567890123456789 | {-4567890123456789}
(5 rows)

as you can soon convince yourself by inspecting the contents of
int8_tbl:

q1 | q2
------------------+-------------------
123 | 456
123 | 4567890123456789
4567890123456789 | 123
4567890123456789 | 4567890123456789
4567890123456789 | -4567890123456789
(5 rows)

I got that answer with patched HEAD, but all the back branches
give me

id | array
-------------------+-------------------------------------
456 | {4567890123456789,4567890123456789}
4567890123456789 | {4567890123456789,4567890123456789}
123 | {123}
4567890123456789 | {4567890123456789,4567890123456789}
-4567890123456789 | {-4567890123456789}
(5 rows)

So this does indeed need to be back-patched as far as 8.4.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.2rc1 produces incorrect results
Date: 2012-09-07 02:54:25
Message-ID: 9813.1346986465@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:

> Attached is a draft patch against HEAD for this.

I've finished back-porting this. I'm not going to commit it until 9.2.0
is definitely gold, but attached is the 9.1 version of the patch, if
you'd like to try it and verify that it fixes your original problem.

regards, tom lane

Attachment Content-Type Size
planner-param-handling-9.1.patch.gz application/octet-stream 10.3 KB