Re: Allowing NOT IN to use ANTI joins

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Allowing NOT IN to use ANTI joins
Date: 2014-06-11 09:03:21
Message-ID: CAApHDvo09A89T-HiL9NoDpWPr6UbVPqcQ0ptozUZKnYHfGU=MQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 10, 2014 at 2:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> > If you are using NOT IN, then once you find a NULL in the outer input (if
> > the outer input is the in-list: clearly you can't reverse the two inputs
> in
> > this case), you don't even need to finish reading the outer input, nor
> > start reading the inner input, because all rows are automatically
> excluded
> > by the weird semantics of NOT IN.
>
> The point I'm trying to make is that the goal of most join types is to
> give an answer without having necessarily read all of either input.
> For instance, if we tried to do this with a mergejoin it wouldn't work
> reliably: it might suppose that it could accept an outer row on the basis
> of no match in a higher-order sort column before it'd reached any nulls
> appearing in lower-order sort columns.
>
> You might be right that we could hot-wire the hash join case in
> particular, but I'm failing to see the point of expending lots of extra
> effort in order to deliver a useless answer faster. If there are NULLs
> in the inner input, then NOT IN is simply the wrong query to make, and
> giving an empty output in a relatively short amount of time isn't going
> to help clue the user in on that. (If the SQL standard would let us do
> so, I'd be arguing for throwing an error if we found a NULL.)
>
>
This got me thinking. It is probably a bit useless and wrong to perform a
NOT IN when the subquery in the IN() clause can have NULL values, so I
guess in any realistic useful case, the user will either have a NOT NULL
constraint on the columns, or they'll do a WHERE col IS NOT NULL, so I
should likely also allow a query such as:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

to also perform an ANTI JOIN. I think it's just a matter of
changing targetListIsGuaranteedNotToHaveNulls so that if it does not find
the NOT NULL constraint, to check the WHERE clause of the query to see if
there's any not null quals.

I'm about to put this to the test, but if it works then I think it should
cover many more cases for using NOT IN(), I guess we're only leaving out
function calls and calculations in the target list.

Regards

David Rowley

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marti Raudsepp 2014-06-11 09:17:03 Re: Allowing NOT IN to use ANTI joins
Previous Message David Rowley 2014-06-11 08:53:03 Re: Allowing NOT IN to use ANTI joins