not in vs not exists - vastly diferent performance

Lists: pgsql-sql
From: "Iain" <iain(at)mst(dot)co(dot)jp>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: not in vs not exists - vastly diferent performance
Date: 2003-12-25 01:50:18
Message-ID: 013d01c3ca89$7307d7d0$7201a8c0@mst1x5r347kymb
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hi All,

I found this interesting and thought I'd offer it up for comment.

I have the following delete:

delete from tableB where id not in (select id from tableA);

There are about 100,000 rows in table A, and about 500,000 in table B. id is
indexed on both tables. This is just a development DB, and I wanted to clean
it up so I could put in RI constraints. Somewhere along the line, records
have been deleted from tableA leaving orphans in tableB.

I launched the query yesterday afternoon, and it hadn't returned as of this
morning, 15 hours later. Running top showed that CPU utilization was running
close to 100%, and the disk was not busy at all. Anyway, I killed it and did
some testing:

Analyse revealed this plan, and varying random_page_cost between 1 and 4
didn't affect it:

Seq Scan on tableB (cost=0.00..1003619849.56 rows=251513 width=6)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on tableA (cost=0.00..3738.64 rows=100664 width=4)

recoding the delete to use not exists as follows:

delete from tableB where not exists (select id from tableA where tableA.id =
tableB.id);

Gave this plan:

Seq Scan on tableB (cost=0.00..719522.41 rows=236131 width=6)
Filter: (NOT (subplan))
SubPlan
-> Index Scan using tableB_pk on tableA(cost=0.00..3.01 rows=2
width=4)
Index Cond: ((id)::integer = ($0)::integer)

This deleted 1200 rows in about 2 seconds. Much better.

Anyway, I was a little surprised that "not in" chose to use a seq scan on
the tableA in this case. I had imagined, given that statistics were up to
date and the size of the table, that the plan would have been similar to
that generated by not exists, or perhaps would have used a hash table based
on the tableA ids.

Something to think about anyway.
Regards
Iain


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Iain" <iain(at)mst(dot)co(dot)jp>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: not in vs not exists - vastly diferent performance
Date: 2003-12-25 16:32:21
Message-ID: 20597.1072369941@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

"Iain" <iain(at)mst(dot)co(dot)jp> writes:
> I found this interesting and thought I'd offer it up for comment.

You didn't say what PG version you are using, but I'd venture to bet
it is pre-7.4.

regards, tom lane


From: Randolf Richardson <rr(at)8x(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: not in vs not exists - vastly diferent performance
Date: 2004-01-01 04:48:37
Message-ID: Xns9462D1831D7E0rr8xca@200.46.204.72
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

"tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane)" wrote in comp.databases.postgresql.sql:

> "Iain" <iain(at)mst(dot)co(dot)jp> writes:
>
>> I found this interesting and thought I'd offer it up for comment.
>
> You didn't say what PG version you are using, but I'd venture to bet
> it is pre-7.4.

This must've been one of the enhancements. =)

--
Sir Randolf, noble spam fighter - rr(at)8x(dot)ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.


From: "Iain" <iain(at)mst(dot)co(dot)jp>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: not in vs not exists - vastly diferent performance
Date: 2004-01-05 01:40:13
Message-ID: 001601c3d32c$dd0267d0$7201a8c0@mst1x5r347kymb
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hi Tom,

> You didn't say what PG version you are using, but I'd venture to bet
> it is pre-7.4.

Sorry I didn't give the version number, here is the output from select
version():

PostgreSQL 7.4 on i686-pc-linux-gnu, compiled by GCC 2.95.3

This is the first time that we have compiled the system from source,
previously we used rpm, but it seems very straight forward. I don't know if
there may have been something we did wrong there.

I understand that every DB environment is different, but it sounds like for
the example I gave, you expected that the NOT IN optimization would have
chosen a better plan. As to why it didn't I have no idea at this stage. If
there is anything you want me to do to test it, for example on the
regression test DB, then let me know.

regards
Iain