BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data

Lists: pgsql-bugspgsql-patches
From: "Marcello Ceschia" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-03-31 12:26:54
Message-ID: 200803311226.m2VCQskW015483@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches


The following bug has been logged online:

Bug reference: 4070
Logged by: Marcello Ceschia
Email address: Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de
PostgreSQL version: 8.2.5 and 8.3.0
Operating system: Windows XP
Description: Join more then ~15 tables let postgreSQL produces wrong
data
Details:

We have an situation where we must join ~30 tables to get a general table.
Don't ask why we do this this way.

After an unknown number of columns the joind values are not correct. We get
only the value from column 1.
If I separete the query and generate an temporary table with the first 20
tables and join them later with the rest one, all works fine.
All values in the result table are correct.

If you need more information, contact me I can send some example data.


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Marcello Ceschia <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-03-31 12:57:31
Message-ID: 47F0DFBB.7080209@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Marcello Ceschia wrote:
> If you need more information, contact me I can send some example data.

Yes, we need more information. Please send a minimal test case with
CREATE TABLE statements and data required to reproduce the problem.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-03-31 13:12:24
Message-ID: 47F0E338.6070105@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Ceschia, Marcello wrote:
> Can I send a backup with some data?

pg_dump output will do just fine, but please try to reduce the test case
to a simpler one. And send the query that's not behaving as expected as
well, and describe what output you expected.

And please keep the mailing list CC'd so that others can help.

>
> -----Ursprüngliche Nachricht-----
> Von: Heikki Linnakangas [mailto:hlinnaka(at)gmail(dot)com] Im Auftrag von Heikki Linnakangas
> Gesendet: Montag, 31. März 2008 14:58
> An: Ceschia, Marcello
> Cc: pgsql-bugs(at)postgresql(dot)org
> Betreff: Re: [BUGS] BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
>
> Marcello Ceschia wrote:
>> If you need more information, contact me I can send some example data.
>
> Yes, we need more information. Please send a minimal test case with
> CREATE TABLE statements and data required to reproduce the problem.
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Cc: "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 13:22:34
Message-ID: 47F4DA1A.3000508@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Heikki Linnakangas wrote:
> Ceschia, Marcello wrote:
>> In query "query_not_working" all values from column "136_119" has the
>> value of the first column.
>>
>> Using the splitted query ("working_version") it works.
>>
>> I hope this data will help to find the bug.
>
> Thanks.
>
> Oh, the query actually gives an assertion failure on an
> assertion-enabled build, so this is clearly a bug:
>
> TRAP: FailedAssertion("!(attnum > 0 && attnum <=
> list_length(rte->joinaliasvars))", File: "parse_relation.c", Line: 1697)
>
> gdb tells that attnum is -31393 at that point. That's because
> get_rte_attribute_type() takes an AttrNumber, which is int16, and
> make_var() is trying to pass 34143, so it overflows.
>
> It seems we should extend AttrNumber to int32; we don't use AttrNumber
> in any of the on-disk structs. Though you still couldn't have more than
> MaxHeapAttributeNumber (1600) attributes in a table or
> MaxTupleAttributeNumber (1664) in a result set or intermediate tuples,
> like the output of a sort node, at least you could join ridiculously
> wide tables like that as long as you project out enough columns.

Attached is a self-contained test script to reproduce this. It produces
an assertion failure in 8.1 - CVS HEAD. On 8.0, it runs for ~5 minutes,
and finally produces an "ERROR: invalid varattno -32768" elog. On 7.4,
it runs for even longer, but returns the correct result in the end.
Looking at the code, I believe the same bug is present in 8.0 and 7.4 as
well, but is masked by something else in those releases.

On second thought, expanding AttrNumber to int32, wholesale, might not
be a good idea, because AttrNumber is used in the function signature of
TupleDescInitEntry and some other functions that might be used in
C-language user-defined functions. They would need to be recompiled. Is
this something to worry about?

Another approach is to track down all uses of AttrNumber where it's used
to refer to an entry in a target list (varattno), and change those to
plain ints. Attached is a patch to do that. This seems like a safer
approach, but I'm slightly worried that I might've missed some variables
that need to be changed.

Thoughts?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
varattno-crash.sql text/x-sql 23.9 KB
varattno-int-1.patch text/x-diff 12.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 14:14:38
Message-ID: 29119.1207232078@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> On second thought, expanding AttrNumber to int32, wholesale, might not
> be a good idea,

No, it wouldn't. For one thing it'd be a protocol break --- column
numbers are int16 --- and for another, we'd have terrible performance
problems with such wide rows. Actually rows are supposed to be limited
to ~1600 columns, anyway, because of HeapTupleHeader limitations.
Apparently you've found a path where that restriction isn't enforced
correctly, but I haven't seen the referenced message yet ...

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 15:00:53
Message-ID: 47F4F125.3050809@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> On second thought, expanding AttrNumber to int32, wholesale, might not
>> be a good idea,
>
> No, it wouldn't. For one thing it'd be a protocol break --- column
> numbers are int16 ---

I wasn't planning to change that.

> and for another, we'd have terrible performance
> problems with such wide rows.

Yes, we probably would :-). Though if there's any nasty O(n^2) behavior
left in there, we should look at optimizing it anyway to speed up more
reasonably sized queries, in the range of a few hundred columns.

> Actually rows are supposed to be limited
> to ~1600 columns, anyway, because of HeapTupleHeader limitations.

The trick is that that limitation doesn't apply to the intermediate
virtual tuples we move around in the executor. Those are just arrays of
Datums, and can have more than MaxTupleAttributeNumber attributes, as
long as you project away enough attributes, bringing it below that
limit, before returning it to the client or materializing it into a
HeapTuple or MinimalTuple in the executor.

> Apparently you've found a path where that restriction isn't enforced
> correctly, but I haven't seen the referenced message yet ...

Enforcing the limit for virtual tuples as well, and checking for the
limit in the planner is one option, but it would cripple the ability to
join extremely wide tables. For example, if you had 10 tables with 200
columns each, you couldn't join them together even for the purposes of
COUNT(*). Granted, that's not a very common thing to do, this is the
first time this bug is reported after all, but I'd prefer to keep the
capability if possible.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 15:18:20
Message-ID: 254.1207235900@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> Actually rows are supposed to be limited
>> to ~1600 columns, anyway, because of HeapTupleHeader limitations.

> The trick is that that limitation doesn't apply to the intermediate
> virtual tuples we move around in the executor.

I'm really unwilling to design the system in such a way that whether a
query works depends on whether a particular executor node tries to
materialize tuples or not.

> Enforcing the limit for virtual tuples as well, and checking for the
> limit in the planner is one option, but it would cripple the ability to
> join extremely wide tables. For example, if you had 10 tables with 200
> columns each, you couldn't join them together even for the purposes of
> COUNT(*).

Huh? Only if you actually tried to select all the columns.

I still haven't seen the actual bug description come by here, and the
pgsql-bugs archive hasn't got it either.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 17:00:50
Message-ID: 20080403170050.GB2528@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:

> I still haven't seen the actual bug description come by here, and the
> pgsql-bugs archive hasn't got it either.

http://archives.postgresql.org/pgsql-bugs/2008-03/msg00351.php

and continues in April here

http://archives.postgresql.org/pgsql-bugs/2008-04/msg00031.php

(apparently some mails on that thread are missing ...)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 17:41:30
Message-ID: 5462.1207244490@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> I still haven't seen the actual bug description come by here, and the
>> pgsql-bugs archive hasn't got it either.

> (apparently some mails on that thread are missing ...)

That's what I meant. Heikki is quoting himself from a message that
hasn't appeared anywhere public, and he must have had at least one
message from the OP that hasn't appeared either. So the rest of us
are still mostly in the dark about the problem.

regards, tom lane


From: John R Pierce <pierce(at)hogranch(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 17:54:10
Message-ID: 47F519C2.1090602@hogranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:
> That's what I meant. Heikki is quoting himself from a message that
> hasn't appeared anywhere public, and he must have had at least one
> message from the OP that hasn't appeared either. So the rest of us
> are still mostly in the dark about the problem.
>

I got this one, which appears to be the head of this thread...

-------- Original Message --------
Subject: [BUGS] BUG #4070: Join more then ~15 tables let postgreSQL
produces wrong data
Date: Mon, 31 Mar 2008 12:26:54 GMT
From: Marcello Ceschia <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>
To: pgsql-bugs(at)postgresql(dot)org

The following bug has been logged online:

Bug reference: 4070
Logged by: Marcello Ceschia
Email address: Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de
PostgreSQL version: 8.2.5 and 8.3.0
Operating system: Windows XP
Description: Join more then ~15 tables let postgreSQL produces wrong
data
Details:

We have an situation where we must join ~30 tables to get a general table.
Don't ask why we do this this way.

After an unknown number of columns the joind values are not correct. We get
only the value from column 1.
If I separete the query and generate an temporary table with the first 20
tables and join them later with the rest one, all works fine.
All values in the result table are correct.

If you need more information, contact me I can send some example data.


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 18:22:29
Message-ID: 47F52065.2030206@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Tom Lane wrote:
>>> I still haven't seen the actual bug description come by here, and the
>>> pgsql-bugs archive hasn't got it either.
>
>> (apparently some mails on that thread are missing ...)
>
> That's what I meant. Heikki is quoting himself from a message that
> hasn't appeared anywhere public, and he must have had at least one
> message from the OP that hasn't appeared either. So the rest of us
> are still mostly in the dark about the problem.

Hmm, strange. Looks like my mail client decided to sent that mail to
pgsql-bugs-owner@ instead of pgsql-bugs@ for some reasone. Here's the
missing mail:

Ceschia, Marcello wrote:
> In query "query_not_working" all values from column "136_119" has the
value of the first column.
>
> Using the splitted query ("working_version") it works.
>
> I hope this data will help to find the bug.

Thanks.

Oh, the query actually gives an assertion failure on an
assertion-enabled build, so this is clearly a bug:

TRAP: FailedAssertion("!(attnum > 0 && attnum <=
list_length(rte->joinaliasvars))", File: "parse_relation.c", Line: 1697)

gdb tells that attnum is -31393 at that point. That's because
get_rte_attribute_type() takes an AttrNumber, which is int16, and
make_var() is trying to pass 34143, so it overflows.

It seems we should extend AttrNumber to int32; we don't use AttrNumber
in any of the on-disk structs. Though you still couldn't have more than
MaxHeapAttributeNumber (1600) attributes in a table or
MaxTupleAttributeNumber (1664) in a result set or intermediate tuples,
like the output of a sort node, at least you could join ridiculously
wide tables like that as long as you project out enough columns.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-patches <pgsql-patches(at)postgresql(dot)org>, "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-03 19:32:43
Message-ID: 14497.1207251163@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Oh, the query actually gives an assertion failure on an
> assertion-enabled build, so this is clearly a bug:
> TRAP: FailedAssertion("!(attnum > 0 && attnum <=
> list_length(rte->joinaliasvars))", File: "parse_relation.c", Line: 1697)

Okay, I looked at this more closely and realized that our earlier
discussion was a bit beside the point. It's true that we can't
support a targetlist within any single plan tree that exceeds 1600
items, but that is not what the problem is here. The problem here
is that the described query generates a JOIN RTE having more than
32K join alias entries, and that means that it's impossible to build
a Var referencing the alias entries that're further down in the list,
because varattno is only int16. This is independent of how many
targetlist entries are actually requested.

I think the only sane approach to closing the bug in the stable branches
is to throw error if there's more than 32K columns in a join RTE.
The question is whether it's really worthwhile to do more than that
in HEAD. I think that people using reasonable table designs are never
going to run into this limitation anyway.

I don't much like the proposed patch --- widening AttrNumber seems
saner, or else splitting it into two types, one for varattno and
one for table column indexes and targetlist indexes. But even
phrasing it that way makes it sound pretty silly. Most Vars will
be referring to things that can't possibly exceed 1600.

I was thinking a day or two ago about fixing the planner's problems with
non-nullable subselect outputs underneath outer joins, and one of the
thoughts there was that we might be able to get rid of join alias vars
entirely if we had a brighter solution. Or at least not build the
entire dang list, but only the entries actually needed in the query.

What I propose we do is throw error for the moment, and make a TODO
note to revisit the question after redesigning outer-join planning.
Which is something I do intend to do for 8.4.

regards, tom lane


From: "Ceschia, Marcello" <Marcello(dot)Ceschia(at)medizin(dot)uni-leipzig(dot)de>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "pgsql-patches" <pgsql-patches(at)postgresql(dot)org>, "PostgreSQL Bugs" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [PATCHES] Re: BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data
Date: 2008-04-04 07:20:16
Message-ID: 160E3DD4FB702C4CB860C65186686E690259F5E6@MRZS152229.medizin.uni-leipzig.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
Gesendet: Donnerstag, 3. April 2008 21:33
An: Heikki Linnakangas
Cc: Alvaro Herrera; pgsql-patches; Ceschia, Marcello; PostgreSQL Bugs
Betreff: Re: [PATCHES] Re: [BUGS] BUG #4070: Join more then ~15 tables let postgreSQL produces wrong data

> What I propose we do is throw error for the moment, and make a TODO
> note to revisit the question after redesigning outer-join planning.
> Which is something I do intend to do for 8.4.

For me that's a good solution.
My motivation for reporting this bug was to notice the developer about the problem.

Of course the query we did is a strange way to get a result, but it was possible without warning/error. At the moment we use a function to join the tables one by one and for the moment it is working.

Thank you for your helps

Marcello