Re: NOT IN >2hrs vs EXCEPT < 2 sec.

Lists: pgsql-performance
From: Kevin Traster <kevin(at)mffais(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: NOT IN >2hrs vs EXCEPT < 2 sec.
Date: 2009-01-29 07:01:34
Message-ID: 72188cf00901282301x6abba690p38c3dff0df4bc338@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

2 questions:

1) Different costs for same actions. Doing an explain on 2 nearly identical
queries both involving the same Index scan on same table has 2 widely
different costs for same Index scan 303375872.86 vs. 12576.70

2) Simple query using NOT IN (subquery)was killed after 2 hrs, using the
same query (query) except (query) ran in < 2 sec.

Summary:

On devel box (Unix PG version 8.3.5) with no other database activity or
system activity after immediately completing a vacuum analyze.

The original query (below) was running for over 2 hrs and was killed.:

select distinct ciknum into tmpnocikinowner from cik where ciknum not in
(select cik from owner_cik_master);

est total cost: 303375872.86, for Index Scan: 303375616.75

Simple query broken down: explain select distinct ciknum into
tmpnocikinowner from cik ;

est total cost: 12576.70, for Index Scan: 12064.49

and

select cik from owner_cik_master

est total cost: 2587.36, for Index Scan: N/A

Actual time, the query was killed after 2hrs,

However, we ran:

select distinct ciknum into tmpnocikinowner from cik ; - actual time 861.487
ms
(select ciknum from tmpnocikinowner) except (select cik from
owner_cik_master); - actual time 1328.094 ms

##### Console log below with details ######

devel=# explain select distinct ciknum into tmpnocikinowner from cik where
ciknum not in (select cik from owner_cik_master);
QUERY
PLAN
------------------------------------------------------------------------------------------------
Unique (cost=3506.21..303375872.86 rows=71946 width=8)
-> Index Scan using cik_ciknum_idx on cik (cost=3506.21..303375616.75
rows=102444 width=8)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=3506.21..6002.40 rows=186019 width=4)
-> Seq Scan on owner_cik_master (cost=0.00..2684.19
rows=186019 width=4)
(6 rows)

Time: 0.723 ms
devel=# explain select ciknum into tmpnocikinowner from cik where ciknum not
in (select cik from owner_cik_master);
QUERY
PLAN
--------------------------------------------------------------------------------------
Seq Scan on cik (cost=3506.21..303367660.13 rows=102444 width=8)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=3506.21..6002.40 rows=186019 width=4)
-> Seq Scan on owner_cik_master (cost=0.00..2684.19 rows=186019
width=4)
(5 rows)

Time: 0.588 ms
devel=# explain select ciknum::int into tmpnocikinowner from cik where
ciknum::int not in (select cik::int from owner_cik_master);
QUERY
PLAN
--------------------------------------------------------------------------------------
Seq Scan on cik (cost=3506.21..303368428.46 rows=102444 width=8)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=3506.21..6002.40 rows=186019 width=4)
-> Seq Scan on owner_cik_master (cost=0.00..2684.19 rows=186019
width=4)
(5 rows)

Time: 0.918 ms
devel=# explain select ciknum into tmpnocikinowner from cik
;
QUERY PLAN
-----------------------------------------------------------
Seq Scan on cik (cost=0.00..4107.87 rows=204887 width=8)
(1 row)

Time: 0.438 ms
devel=# explain select distinct ciknum into tmpnocikinowner from cik ;
QUERY
PLAN
-----------------------------------------------------------------------------------------
Unique (cost=0.00..12576.70 rows=143891 width=8)
-> Index Scan using cik_ciknum_idx on cik (cost=0.00..12064.49
rows=204887 width=8)
(2 rows)

Time: 0.468 ms
devel=# select distinct ciknum into tmpnocikinowner from cik ;
SELECT
Time: 861.487 ms

devel=# explain select ciknum from tmpnocikinowner where ciknum not in
(select cik from owner_cik_master);
QUERY
PLAN
--------------------------------------------------------------------------------------
Seq Scan on tmpnocikinowner (cost=3506.21..261092922.31 rows=88168
width=8)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=3506.21..6002.40 rows=186019 width=4)
-> Seq Scan on owner_cik_master (cost=0.00..2684.19 rows=186019
width=4)
(5 rows)

Time: 0.629 ms

devel=# explain select cik from owner_cik_master;
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on owner_cik_master (cost=0.00..2684.19 rows=186019 width=4)
(1 row)

Time: 0.415 ms
devel=# explain select ciknum from tmpnocikinowner;;
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on tmpnocikinowner (cost=0.00..2587.36 rows=176336 width=8)
(1 row)

Time: 0.413 ms
devel=# explain (select ciknum from tmpnocikinowner) except (select cik
from owner_cik_master);
QUERY
PLAN
------------------------------------------------------------------------------------------------
SetOp Except (cost=47309.23..49121.00 rows=36236 width=8)
-> Sort (cost=47309.23..48215.12 rows=362355 width=8)
Sort Key: "*SELECT* 1".ciknum
-> Append (cost=0.00..8895.10 rows=362355 width=8)
-> Subquery Scan "*SELECT* 1" (cost=0.00..4350.72
rows=176336 width=8)
-> Seq Scan on tmpnocikinowner (cost=0.00..2587.36
rows=176336 width=8)
-> Subquery Scan "*SELECT* 2" (cost=0.00..4544.38
rows=186019 width=4)
-> Seq Scan on owner_cik_master (cost=0.00..2684.19
rows=186019 width=4)
(8 rows)

Time: 0.625 ms
devel=# (select ciknum from tmpnocikinowner) except (select cik from
owner_cik_master);
ciknum
--------
(0 rows)

Time: 1328.094 ms


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Kevin Traster <kevin(at)mffais(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: NOT IN >2hrs vs EXCEPT < 2 sec.
Date: 2009-01-29 07:37:20
Message-ID: dcc563d10901282337p51acabfbt7b5f1d31bf77857d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Jan 29, 2009 at 12:01 AM, Kevin Traster <kevin(at)mffais(dot)com> wrote:
> 2 questions:
>
> 1) Different costs for same actions. Doing an explain on 2 nearly identical
> queries both involving the same Index scan on same table has 2 widely
> different costs for same Index scan 303375872.86 vs. 12576.70

Pretty sure this is a FAQ by now.

not in and except treat nulls differently. If you table has nullable
fields and nulls would break your query, then not in () is a bad
choice. Therefore, effort to optimize had been placed into except,
which is distinctly, symantically different from not in ().

It seems like some shift in the pg community has happened where we're
suddenly getting a lot of folks who came from a database where not in
and except are treated the same, even though they most definitely do
not mean the same thing.


From: Kevin Traster <kevin(at)mffais(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: NOT IN >2hrs vs EXCEPT < 2 sec.
Date: 2009-01-29 07:54:45
Message-ID: 72188cf00901282354w18e0b03bh4b3cc007d2b2531a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Jan 28, 2009 at 11:37 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>wrote:

> On Thu, Jan 29, 2009 at 12:01 AM, Kevin Traster <kevin(at)mffais(dot)com> wrote:
> > 2 questions:
> >
> > 1) Different costs for same actions. Doing an explain on 2 nearly
> identical
> > queries both involving the same Index scan on same table has 2 widely
> > different costs for same Index scan 303375872.86 vs. 12576.70
>
> Pretty sure this is a FAQ by now.
>
> not in and except treat nulls differently. If you table has nullable
> fields and nulls would break your query, then not in () is a bad
> choice. Therefore, effort to optimize had been placed into except,
> which is distinctly, symantically different from not in ().
>
> It seems like some shift in the pg community has happened where we're
> suddenly getting a lot of folks who came from a database where not in
> and except are treated the same, even though they most definitely do
> not mean the same thing.
>


From: Kevin Traster <kevin(at)mffais(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: NOT IN >2hrs vs EXCEPT < 2 sec.
Date: 2009-01-29 07:56:15
Message-ID: 72188cf00901282356x343b1e8fn339b06974c22bb39@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Jan 28, 2009 at 11:37 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>wrote:

> On Thu, Jan 29, 2009 at 12:01 AM, Kevin Traster <kevin(at)mffais(dot)com> wrote:
> > 2 questions:
> >
> > 1) Different costs for same actions. Doing an explain on 2 nearly
> identical
> > queries both involving the same Index scan on same table has 2 widely
> > different costs for same Index scan 303375872.86 vs. 12576.70
>
> Pretty sure this is a FAQ by now.
>
> not in and except treat nulls differently. If you table has nullable
> fields and nulls would break your query, then not in () is a bad
> choice. Therefore, effort to optimize had been placed into except,
> which is distinctly, symantically different from not in ().
>
> It seems like some shift in the pg community has happened where we're
> suddenly getting a lot of folks who came from a database where not in
> and except are treated the same, even though they most definitely do
> not mean the same thing.
>

Umm... No. The top of the post you quoted regards the difference between the
query "get ciknum from cik" versus get ciknum from cik where NOT IN.... The
only differene between the two queries is the qualification of "where ciknum
not in ....". It does not involve the difference between NOT IN versus
Except

Both queries do an Index Scan using cik_ciknum_idx and those numbers show
the different costs doing the same task.

In this case, neither table allowes nulls in the columns, both tables have
single indexes on the columns used.

Regarding the previous posts about the same issues of PERFORMENCE between
NOT IN versus EXCEPT. There has not been any answer to explain it - just
talk about the differenences between the two results.

Yes, I can still get the results using EXCEPT but it would be nice to no why
I can't get NOT IN to complete the simple query.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Kevin Traster <kevin(at)mffais(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: NOT IN >2hrs vs EXCEPT < 2 sec.
Date: 2009-01-29 13:00:16
Message-ID: 87d4e6qnbz.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Kevin Traster <kevin(at)mffais(dot)com> writes:

> Regarding the previous posts about the same issues of PERFORMENCE between
> NOT IN versus EXCEPT. There has not been any answer to explain it - just
> talk about the differenences between the two results.
>
> Yes, I can still get the results using EXCEPT but it would be nice to no why
> I can't get NOT IN to complete the simple query.

There are two answers here. One you've already been given, that NOT IN has to
handle NULLs specially and that makes these plans not equivalent. The NOT IN
is decidedly harder to solve.

The other answer is that EXCEPT is a set operation which in Postgres uses a
completely different set of logic. Even if you used NOT EXISTS which really is
equivalent to EXCEPT the resulting plans would be different. Which one would
be better would depend on the circumstances. In an ideal world every
equivalent query would generate identical plans. We don't live in an ideal
world and Postgres isn't perfect.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Traster <kevin(at)mffais(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: NOT IN >2hrs vs EXCEPT < 2 sec.
Date: 2009-01-29 14:56:41
Message-ID: 28345.1233241001@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Kevin Traster <kevin(at)mffais(dot)com> writes:
> Unique (cost=3506.21..303375872.86 rows=71946 width=8)
> -> Index Scan using cik_ciknum_idx on cik (cost=3506.21..303375616.75
> rows=102444 width=8)
> Filter: (NOT (subplan))
> SubPlan
> -> Materialize (cost=3506.21..6002.40 rows=186019 width=4)
> -> Seq Scan on owner_cik_master (cost=0.00..2684.19
> rows=186019 width=4)

It will help some if you raise work_mem enough so you get a "hashed
subplan" there, assuming the NOT IN is on a hashable datatype.

But as was already noted, more work has been put into optimizing
EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially
less useful due to its unintuitive but spec-mandated handling of NULLs.
(And this disparity will be even larger in 8.4.) We're not going to
apologize for that, and we're not going to regard it as a bug.

regards, tom lane