Re: BUG #7556: "select not in sub query" plan very poor vs "not exists"

Lists: pgsql-bugs
From: l1t(at)tom(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #7556: "select not in sub query" plan very poor vs "not exists"
Date: 2012-09-19 05:48:08
Message-ID: E1TED8q-00047D-Va@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 7556
Logged by: lt
Email address: l1t(at)tom(dot)com
PostgreSQL version: 9.2.0
Operating system: windows xp
Description:

test=# create table sli_test (id int primary key,info varchar(20));
test=# insert into sli_test select
generate_series(1,1000000),'digoal'||generate_series(1,1000000);
test=# analyze verbose sli_test;
test=# create table sli_test2 (id int not null,info varchar(20));
test=# insert into sli_test2 select
generate_series(1,1000000),'dbase'||generate_series(1,1000000);
test=# analyze verbose sli_test2;
test=# explain select max(a.info)from sli_test a where a.id not in(select
b.id from sli_test2 b where b.id<50000);
QUERY PLAN
---------------------------------------------------------------------------------------
Aggregate (cost=9241443774.00..9241443774.01 rows=1 width=12)


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-bugs(at)postgresql(dot)org>,<l1t(at)tom(dot)com>
Subject: Re: BUG #7556: "select not in sub query" plan very poor vs "not exists"
Date: 2012-09-19 18:06:51
Message-ID: 5059C36B020000250004A505@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

<l1t(at)tom(dot)com> wrote:

> test=# explain select max(a.info)from sli_test a where a.id not
> in(select b.id from sli_test2 b where b.id<50000);

> Aggregate (cost=9241443774.00..9241443774.01 rows=1 width=12)

A slower plan for NOT IN than NOT EXISTS is a fact, but definitely
not a bug. According to the standard, the semantics of NOT IN are
different from NOT EXISTS when there is a possibility of NULLs on
either side. Because of those different semantics, NOT IN cannot
use certain optimizations which are available for NOT EXISTS.

Technically, if a NOT IN case could be analyzed to the point where
it is clear that both sides of the predicate are definitely free of
NULLs, it could be run the same as NOT EXISTS, but that would add
complexity and run-time expense to the optimizer. Some cases are
simple, but some are not at all practical. We have chosen instead
to recommend that people use NOT EXISTS unless they really want the
rather astonishing behavior of NOT IN.

-Kevin


From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: l1t(at)tom(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #7556: "select not in sub query" plan very poor vs "not exists"
Date: 2012-09-20 04:39:39
Message-ID: 505A9E0B.3000600@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 09/19/2012 01:48 PM, l1t(at)tom(dot)com wrote:
> The following bug has been logged on the website:
>
> Bug reference: 7556
> Logged by: lt
> Email address: l1t(at)tom(dot)com
> PostgreSQL version: 9.2.0
> Operating system: windows xp
> Description:
>

create table sli_test (id int primary key,info varchar(20));
insert into sli_test select
generate_series(1,1000000),'digoal'||generate_series(1,1000000);
analyze verbose sli_test;
create table sli_test2 (id int not null,info varchar(20));
insert into sli_test2 select
generate_series(1,1000000),'dbase'||generate_series(1,1000000);
analyze verbose sli_test2;

explain select max(a.info)from sli_test a where a.id not in(select
b.id from sli_test2 b where b.id<50000);

> QUERY PLAN
> ---------------------------------------------------------------------------------------
> Aggregate (cost=9241443774.00..9241443774.01 rows=1 width=12)

Here's what I get on 9.1:

regress=# explain select max(a.info)from sli_test a where a.id not in(select
regress(# b.id from sli_test2 b where b.id<50000);
QUERY PLAN

---------------------------------------------------------------------------------
Aggregate (cost=38050.82..38050.83 rows=1 width=12)
-> Seq Scan on sli_test a (cost=18026.82..36800.82 rows=500000
width=12)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on sli_test2 b (cost=0.00..17906.00 rows=48329
width=4)
Filter: (id < 50000)
(6 rows)

It runs in about 500ms here.

You don't appear to have posted the full query plan, so it's hard to
compare.

In general, `NOT IN` is a poor formulation for a query; you're better
off with a JOIN or with `NOT EXISTS`. See eg

http://stackoverflow.com/questions/12444142/postgresql-how-to-figure-out-missing-numbers-in-a-column-using-generate-series/12444165#12444165

--
Craig Ringer