Re: "large" IN/NOT IN subqueries result in query returning wrong data

Lists: pgsql-sql
From: "George Pavlov" <gpavlov(at)mynewplace(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: "large" IN/NOT IN subqueries result in query returning wrong data
Date: 2005-12-27 19:41:33
Message-ID: CCB89282FCE1024EA3DCE687A96A5164039EBD13@ehost010-6.exch010.intermedia.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

The following looks like a bug to me, but please someone review and tell
me what I might be missing. Seems that past a certain result set size a
"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches
even when there should be matches. No errors are returned, just faulty
data. The exact threshholds seem to depend on the data, subquery and
possibly the indexes in place. Nothing in the documentation mentions a
limit to what "IN subquery" can take so I am quite dismayed (if there is
a limit I would expect an ERROR/WARNING/something, but not wrong data).
Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on
WinXP, and 8.0.4 and 8.0.5 on Linux.

I have two tables:
t1 (id, name, a_type)
t2 (id, name)

The t1 table is "big" (483211 rows), the name column contains
duplicates, the a_type has only two values. The t2 table is "small" (40
rows), the name values are unique. Some, but not all, t2.name values are
present in t1.name. To be precise, there are 10 t2.name values that do
not occur in t1.name (this is based on extraneous knowledge). I want to
find out which of the t2.name values are not ever used in t1.name:

select count(*) from t2 where t2.name not in (
select t1.name from t1);
--> 0

This should return 10, instead it returns 0!!! Adding a LIMIT to the
subquery and doing some trial and error produces very interesting
results:

select count(*) from t2 where t2.name not in (
select t1.name from t1 limit 261683)
--> 13
select count(*) from t2 where t2.name not in (
select t1.name from t1 limit 261684)
--> 0

What is so magical about 261683? The JOIN alternative produces the
correct results no matter what:

select count(*)
from t2 left join t1 using (name)
where t1.name is null
--> 10

This pretty much summarizes the issue. Any thoughts greatly appreciated.
Follow a few variations to show how the threshhold varies.

-- restrict to only one a_type
--
select count(*) from t2 where t2.name not in (
select t1.name from t1 where t1.a_type='P')
--> 0
select count(*) from t2 where t2.name not in (
select t1.name from t1 where t1.a_type='P' limit 289014)
--> 11
select count(*) from t2 where t2.name not in (
select t1.name from t1 where t1.a_type='P' limit 289015)
--> 0
select count(*)
from t2 left join (select * from t1 where a_type='P') t1 using (name)
where t1.name is null
--> 11

-- try distinct instead
--
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1)
--> 0
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 limit 56001)
--> 10
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 limit 56002)
--> 0

-- distinct + a_type restriction (adding the a_type restriction
-- here _lowers_ the threshhold)
--
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 where t1.a_type='P')
--> 0
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 where t1.a_type='P' limit 33620)
--> 11
select count(*) from t2 where t2.name not in (
select distinct t1.name from t1 where t1.a_type='P' limit 33621)
--> 0


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "George Pavlov" <gpavlov(at)mynewplace(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data
Date: 2005-12-27 19:44:51
Message-ID: 3697.1135712691@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

"George Pavlov" <gpavlov(at)mynewplace(dot)com> writes:
> select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261683)
> --> 13
> select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261684)
> --> 0

> What is so magical about 261683?

Most likely, the 261684'th row of t1 has a NULL value of name.
Many people find the behavior of NOT IN with nulls unintuitive,
but it's per SQL spec ...

regards, tom lane


From: John McCawley <nospam(at)hardgeus(dot)com>
To: George Pavlov <gpavlov(at)mynewplace(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: "large" IN/NOT IN subqueries result in query returning
Date: 2005-12-27 19:53:30
Message-ID: 43B19BBA.8050403@hardgeus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

At a glance I would guess that NULL values in one or both of your tables
is the culprit. NULL values always return false.

Example:

A quick test on my database:

select count(*) FROM tbl_employee;

count
-------
2689

select count(*) FROM tbl_employee WHERE username IS NULL;
count
-------
35

So I have 35 null values.

create table tbl_foo (username varchar(32));

insert into tbl_foo (username) values ('poop');

select count(*) FROM tbl_employee WHERE tbl_employee.username NOT IN
(select tbl_foo.username FROM tbl_foo);
count
-------
2654

So I only get 2654 values that are NOT IN the set 'poop'...i.e. the NULL
values are not included when I use the "NOT IN" my query. Nulls can be
confusing.

Hope this helps.

George Pavlov wrote:

>The following looks like a bug to me, but please someone review and tell
>me what I might be missing. Seems that past a certain result set size a
>"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches
>even when there should be matches. No errors are returned, just faulty
>data. The exact threshholds seem to depend on the data, subquery and
>possibly the indexes in place. Nothing in the documentation mentions a
>limit to what "IN subquery" can take so I am quite dismayed (if there is
>a limit I would expect an ERROR/WARNING/something, but not wrong data).
>Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on
>WinXP, and 8.0.4 and 8.0.5 on Linux.
>
>I have two tables:
> t1 (id, name, a_type)
> t2 (id, name)
>
>The t1 table is "big" (483211 rows), the name column contains
>duplicates, the a_type has only two values. The t2 table is "small" (40
>rows), the name values are unique. Some, but not all, t2.name values are
>present in t1.name. To be precise, there are 10 t2.name values that do
>not occur in t1.name (this is based on extraneous knowledge). I want to
>find out which of the t2.name values are not ever used in t1.name:
>
>select count(*) from t2 where t2.name not in (
> select t1.name from t1);
>--> 0
>
>This should return 10, instead it returns 0!!! Adding a LIMIT to the
>subquery and doing some trial and error produces very interesting
>results:
>
>select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261683)
>--> 13
>select count(*) from t2 where t2.name not in (
> select t1.name from t1 limit 261684)
>--> 0
>
>What is so magical about 261683? The JOIN alternative produces the
>correct results no matter what:
>
>select count(*)
> from t2 left join t1 using (name)
> where t1.name is null
>--> 10
>
>This pretty much summarizes the issue. Any thoughts greatly appreciated.
>Follow a few variations to show how the threshhold varies.
>
>-- restrict to only one a_type
>
>


From: elein <elein(at)varlena(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: George Pavlov <gpavlov(at)mynewplace(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data
Date: 2005-12-28 00:12:51
Message-ID: 20051228001251.GB20674@varlena.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Dec 27, 2005 at 02:44:51PM -0500, Tom Lane wrote:
> "George Pavlov" <gpavlov(at)mynewplace(dot)com> writes:
> > select count(*) from t2 where t2.name not in (
> > select t1.name from t1 limit 261683)
> > --> 13
> > select count(*) from t2 where t2.name not in (
> > select t1.name from t1 limit 261684)
> > --> 0
>
> > What is so magical about 261683?
>
> Most likely, the 261684'th row of t1 has a NULL value of name.
> Many people find the behavior of NOT IN with nulls unintuitive,
> but it's per SQL spec ...
>
> regards, tom lane

In 8.0 we get:

elein=# select 1 in (NULL, 1, 2);
?column?
----------
t
(1 row)

elein=# select 3 not in (NULL, 1, 2);
?column?
----------

(1 row)

For consistency, either both should return NULL or
both return true/false.

For completeness testing, the following are correct.
Select NULL in/not in any list returns NULL.
elein=# select NULL in (1,2);
?column?
----------

(1 row)

elein=# select NULL not in (1,2);
?column?
----------

(1 row)

elein=# select NULL in (NULL, 1,2);
?column?
----------

(1 row)

elein=# select NULL not in (NULL, 1,2);
?column?
----------

(1 row)

elein
--------------------------------------------------------------
elein(at)varlena(dot)com Varlena, LLC www.varlena.com
(510)655-2584(o) (510)543-6079(c)

PostgreSQL Consulting, Support & Training

PostgreSQL General Bits http://www.varlena.com/GeneralBits/
--------------------------------------------------------------
AIM: varlenallc Yahoo: AElein Skype: varlenallc
--------------------------------------------------------------
I have always depended on the [QA] of strangers.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: elein <elein(at)varlena(dot)com>
Cc: George Pavlov <gpavlov(at)mynewplace(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data
Date: 2005-12-28 00:25:40
Message-ID: 9416.1135729540@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

elein <elein(at)varlena(dot)com> writes:
> In 8.0 we get:

> elein=# select 1 in (NULL, 1, 2);
> ?column?
> ----------
> t
> (1 row)

> elein=# select 3 not in (NULL, 1, 2);
> ?column?
> ----------

> (1 row)

> For consistency, either both should return NULL or
> both return true/false.

The above behavior is correct per spec. Feel free to argue its
consistency with the SQL committee ;-)

Note that the above are not inverses because you changed the lefthand
input. You do get consistent results when you just add or omit NOT:

regression=# select 1 not in (NULL, 1, 2);
?column?
----------
f
(1 row)

regression=# select 3 in (NULL, 1, 2);
?column?
----------

(1 row)

regards, tom lane


From: elein <elein(at)varlena(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: elein <elein(at)varlena(dot)com>, George Pavlov <gpavlov(at)mynewplace(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data
Date: 2005-12-28 00:35:03
Message-ID: 20051228003503.GE20674@varlena.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Dec 27, 2005 at 07:25:40PM -0500, Tom Lane wrote:
> elein <elein(at)varlena(dot)com> writes:
> > In 8.0 we get:
>
> > elein=# select 1 in (NULL, 1, 2);
> > ?column?
> > ----------
> > t
> > (1 row)
>
> > elein=# select 3 not in (NULL, 1, 2);
> > ?column?
> > ----------
>
> > (1 row)
>
> > For consistency, either both should return NULL or
> > both return true/false.
>
> The above behavior is correct per spec. Feel free to argue its
> consistency with the SQL committee ;-)

Oh, no! Not the committee!

>
> Note that the above are not inverses because you changed the lefthand
> input. You do get consistent results when you just add or omit NOT:
Yes, you are right. I skipped the permutations to get down to the point.

>
> regression=# select 1 not in (NULL, 1, 2);
> ?column?
> ----------
> f
> (1 row)
>
> regression=# select 3 in (NULL, 1, 2);
> ?column?
> ----------
>
> (1 row)
>
> regards, tom lane

Thanks for your clarification.

~elein
elein(at)varlena(dot)com


From: Greg Stark <gsstark(at)mit(dot)edu>
To: elein <elein(at)varlena(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, George Pavlov <gpavlov(at)mynewplace(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data
Date: 2005-12-28 15:47:33
Message-ID: 878xu55ilm.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

elein <elein(at)varlena(dot)com> writes:

> > Note that the above are not inverses because you changed the lefthand
> > input. You do get consistent results when you just add or omit NOT:
> Yes, you are right. I skipped the permutations to get down to the point.

Remember that NULL means "unknown". So "1 IN (1,2,NULL)" *should* be true
because regardless of what that unknown value is it's still obvious that 1
really is in the list. And "3 NOT IN (1,2,NULL)" is unknown because it depends
on whether that unknown quantity is 3 or not.

IN is the same as "= ANY" so "1 IN (1,2,NULL)" is the same as
"1=1 OR 1=2 OR 1=NULL" which is true even though the last of the three is null.

--
greg