not in vs not exists - vastly diferent performance

From: "Iain" <iain(at)mst(dot)co(dot)jp>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: not in vs not exists - vastly diferent performance
Date: 2003-12-25 01:50:18
Message-ID: 013d01c3ca89$7307d7d0$7201a8c0@mst1x5r347kymb
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hi All,

I found this interesting and thought I'd offer it up for comment.

I have the following delete:

delete from tableB where id not in (select id from tableA);

There are about 100,000 rows in table A, and about 500,000 in table B. id is
indexed on both tables. This is just a development DB, and I wanted to clean
it up so I could put in RI constraints. Somewhere along the line, records
have been deleted from tableA leaving orphans in tableB.

I launched the query yesterday afternoon, and it hadn't returned as of this
morning, 15 hours later. Running top showed that CPU utilization was running
close to 100%, and the disk was not busy at all. Anyway, I killed it and did
some testing:

Analyse revealed this plan, and varying random_page_cost between 1 and 4
didn't affect it:

Seq Scan on tableB (cost=0.00..1003619849.56 rows=251513 width=6)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on tableA (cost=0.00..3738.64 rows=100664 width=4)

recoding the delete to use not exists as follows:

delete from tableB where not exists (select id from tableA where tableA.id =
tableB.id);

Gave this plan:

Seq Scan on tableB (cost=0.00..719522.41 rows=236131 width=6)
Filter: (NOT (subplan))
SubPlan
-> Index Scan using tableB_pk on tableA(cost=0.00..3.01 rows=2
width=4)
Index Cond: ((id)::integer = ($0)::integer)

This deleted 1200 rows in about 2 seconds. Much better.

Anyway, I was a little surprised that "not in" chose to use a seq scan on
the tableA in this case. I had imagined, given that statistics were up to
date and the size of the table, that the plan would have been similar to
that generated by not exists, or perhaps would have used a hash table based
on the tableA ids.

Something to think about anyway.
Regards
Iain

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Chris Travers 2003-12-25 02:45:46 Re: Distributed keys
Previous Message A E 2003-12-25 00:17:14 Use of Setof Record Dynamically