Re: Allowing NOT IN to use ANTI joins

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Allowing NOT IN to use ANTI joins
Date: 2014-07-11 00:17:59
Message-ID: 13766.1405037879@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> Attached is the updated version of the patch.

I spent some time fooling with this patch, cleaning up the join-alias
issue as well as more-cosmetic things. However, while testing it
I realized that the whole thing is based on a false premise: to equate
a NOT IN with an antijoin, we'd have to prove not only that the inner
side is non-nullable, but that the outer comparison values are too.
Here's an example:

regression=# create table zt1 (f1 int);
CREATE TABLE
regression=# insert into zt1 values(1);
INSERT 0 1
regression=# insert into zt1 values(2);
INSERT 0 1
regression=# insert into zt1 values(null);
INSERT 0 1
regression=# create table zt2 (f1 int not null);
CREATE TABLE
regression=# insert into zt2 values(1);
INSERT 0 1

With the patch in place, we get

regression=# explain select * from zt1 where f1 not in (select f1 from zt2);
QUERY PLAN
-------------------------------------------------------------------
Hash Anti Join (cost=64.00..144.80 rows=1200 width=4)
Hash Cond: (zt1.f1 = zt2.f1)
-> Seq Scan on zt1 (cost=0.00..34.00 rows=2400 width=4)
-> Hash (cost=34.00..34.00 rows=2400 width=4)
-> Seq Scan on zt2 (cost=0.00..34.00 rows=2400 width=4)
Planning time: 0.646 ms
(6 rows)

regression=# select * from zt1 where f1 not in (select f1 from zt2);
f1
----
2

(2 rows)

which is the wrong answer, as demonstrated by comparison to the result
without optimization:

regression=# alter table zt2 alter column f1 drop not null;
ALTER TABLE
regression=# explain select * from zt1 where f1 not in (select f1 from zt2);
QUERY PLAN
---------------------------------------------------------------
Seq Scan on zt1 (cost=40.00..80.00 rows=1200 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on zt2 (cost=0.00..34.00 rows=2400 width=4)
Planning time: 0.163 ms
(5 rows)

regression=# select * from zt1 where f1 not in (select f1 from zt2);
f1
----
2
(1 row)

We could no doubt fix this by also insisting that the left-side vars
be provably not null, but that's going to make the patch even slower
and even less often applicable. I'm feeling discouraged about whether
this is worth doing in this form.

For the archives' sake, I attach an updated version with the fixes
I'd applied so far.

regards, tom lane

Attachment Content-Type Size
not_in_anti_join_v0.7.patch text/x-diff 34.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-07-11 00:49:55 Re: WAL replay bugs
Previous Message Michael Paquier 2014-07-10 23:12:15 Re: IMPORT FOREIGN SCHEMA statement