Planning for improved versions of IN/NOT IN

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Planning for improved versions of IN/NOT IN
Date: 2002-11-27 23:56:31
Message-ID: 29128.1038441391@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been thinking about how to improve the performance of queries using
"WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".

In the existing implementation, the subquery result is rescanned to look
for a match to x each time the WHERE clause is executed; this essentially
makes it work like a nestloop join of the stupidest variety. (We do
stick a Materialize node atop the subselect if it looks complicated, but
that's not a big help in typical cases.)

I've thought of three alternative implementations that would perform
better in various scenarios. Each would be relatively simple to
implement; the problem I'm having is figuring out how to get the planner
to choose the best one. The alternatives are basically:

1. Add DISTINCT to the subquery, pull it up into the rangetable (as a subquery
RT entry), and change the = SUBLINK WHERE clause to a simple = against the
subquery output var(s). Essentially this transforms
SELECT ... FROM foo WHERE foo.x IN (SELECT y FROM ...)
to
SELECT ... FROM foo, (SELECT DISTINCT y FROM ...) ss
WHERE foo.x = ss.y

This is not useful for NOT IN, and there are a few restrictions even for
IN (no correlation variables in subquery, no LIMIT, maybe others)?
Also I think it would only work correctly for IN appearing at the top
level of WHERE, though that might be too conservative.
The main case where it could be a win is where the subquery is expected to
produce relatively few output rows, so we could run it as the outer side of
some join plan. In particular, when x is an indexed column in a large
table, fetching x as the inner side of an indexscan nestloop would be much
better than scanning the whole of the outer table.

2. Hash-based implementations: read the subquery once, load its values
into an in-memory hashtable (discarding duplicates), and then probe the
hashtable for each execution of the WHERE clause. This works for both
IN and NOT IN, though we still need an uncorrelated subquery (else the
hashtable can't be reused from row to row). This probably wins for a
moderate number of rows in the subquery result (not too many to hash in
memory) and a fairly large number of outer rows (else building the
hashtable is not repaid).

3. Inner indexscan: essentially, automatically do the IN-to-EXISTS
transform that's presently recommended by the FAQ. This wins if the
subquery result is large but pushing down an equality condition makes
it cheap, and there aren't too many outer rows.

There may also be cases where the existing implementation is still the
best, or perhaps is the only usable one.

The difficulty is that it's not clear how to choose one of these four
ways, short of planning the *entire* query from scratch all four ways :-(.
This seems pretty grim. Approaches #2 and #3 could be handled as local
transformations of the WHERE clause, but we couldn't choose which to use
very well if we don't yet know how many outer rows the WHERE clause will
be executed for. Approach #1 is really planning a completely different
query --- one with an extra FROM-item --- and there's not going to be
all that much commonality in the computations, unless we restrict where
the added FROM-item can be joined to the others, which'd more or less
defeat the purpose.

Anyone see a way around this difficulty?

regards, tom lane


From: "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To: <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Planning for improved versions of IN/NOT IN
Date: 2002-11-28 02:31:37
Message-ID: 053e01c29686$46aaadd0$6500a8c0@internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The difficulty is that it's not clear how to choose one of these four
> ways, short of planning the *entire* query from scratch all four ways :-(.
> This seems pretty grim. Approaches #2 and #3 could be handled as local
> transformations of the WHERE clause, but we couldn't choose which to use
> very well if we don't yet know how many outer rows the WHERE clause will
> be executed for. Approach #1 is really planning a completely different
> query --- one with an extra FROM-item --- and there's not going to be
> all that much commonality in the computations, unless we restrict where
> the added FROM-item can be joined to the others, which'd more or less
> defeat the purpose.

What about in the case of a scalar subquery eg. SELECT x IN
(1,2,3,4,54,56,6), when there maybe hundreds of scalars?

Chris


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Planning for improved versions of IN/NOT IN
Date: 2002-11-28 03:59:46
Message-ID: 605.1038455986@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au> writes:
> What about in the case of a scalar subquery eg. SELECT x IN
> (1,2,3,4,54,56,6), when there maybe hundreds of scalars?

Unrelated to my present problem.

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Planning for improved versions of IN/NOT IN
Date: 2002-11-30 04:32:23
Message-ID: 3DE83F57.10801@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I've been thinking about how to improve the performance of queries using
> "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".
>
> In the existing implementation, the subquery result is rescanned to look
> for a match to x each time the WHERE clause is executed; this essentially
> makes it work like a nestloop join of the stupidest variety. (We do
> stick a Materialize node atop the subselect if it looks complicated, but
> that's not a big help in typical cases.)
>
> I've thought of three alternative implementations that would perform
> better in various scenarios. Each would be relatively simple to
> implement; the problem I'm having is figuring out how to get the planner
> to choose the best one. The alternatives are basically:
>

[abbreviated]
> 1. Add FROM item
> 2. Hash-based
> 3. Inner indexscan
> 4. the existing implementation

> The difficulty is that it's not clear how to choose one of these four
> ways, short of planning the *entire* query from scratch all four ways :-(.
> This seems pretty grim. Approaches #2 and #3 could be handled as local
> transformations of the WHERE clause, but we couldn't choose which to use
> very well if we don't yet know how many outer rows the WHERE clause will
> be executed for. Approach #1 is really planning a completely different
> query --- one with an extra FROM-item --- and there's not going to be
> all that much commonality in the computations, unless we restrict where
> the added FROM-item can be joined to the others, which'd more or less
> defeat the purpose.
>
> Anyone see a way around this difficulty?

How about starting with a rule-based method to make the choice?

1. If uncorrelated: use hash-based approach - ISTM this might address a large
percentage of the problem cases -- it could even handle the
"IN (list-of-scalars)" case. Could it fall back to a
tuplesort/binary-search for the too many to hash in memory case?
2. If correlated: use an inner indexscan
3. If you come up with a pattern where none of the approaches produce a
correct answer, use the existing implementation

You could always get fancier later if needed, but something along these lines
would be a great start.

Joe


From: Mike Mascari <mascarm(at)mascari(dot)com>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Planning for improved versions of IN/NOT IN
Date: 2002-11-30 04:42:29
Message-ID: 3DE841B5.1010401@mascari.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway wrote:
> Tom Lane wrote:
>
>> I've been thinking about how to improve the performance of queries using
>> "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".
>
> How about starting with a rule-based method to make the choice?
>
> 1. If uncorrelated: use hash-based approach - ISTM this might address a
> large
> percentage of the problem cases -- it could even handle the
> "IN (list-of-scalars)" case. Could it fall back to a
> tuplesort/binary-search for the too many to hash in memory case?
> 2. If correlated: use an inner indexscan
> 3. If you come up with a pattern where none of the approaches produce a
> correct answer, use the existing implementation
>
> You could always get fancier later if needed, but something along these
> lines would be a great start.

I curious if any of the rewriting of EXISTS and NOT EXISTS would
address the problem described by Date:

http://www.firstsql.com/iexist.htm

Mike Mascari
mascarm(at)mascari(dot)com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mike Mascari <mascarm(at)mascari(dot)com>
Cc: Joe Conway <mail(at)joeconway(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Planning for improved versions of IN/NOT IN
Date: 2002-11-30 05:44:59
Message-ID: 27336.1038635099@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mike Mascari <mascarm(at)mascari(dot)com> writes:
> I curious if any of the rewriting of EXISTS and NOT EXISTS would
> address the problem described by Date:

> http://www.firstsql.com/iexist.htm

We are not here to redefine the SQL spec ... and especially not to
eliminate its concept of NULL, which is what Date would really like ;-)

The above-quoted screed is based on a claimed logical equivalence
between NOT EXISTS() and NOT IN() that is just plain wrong when you
consider the possibility of NULLs. Rather than "FirstSQL correctly
processes this query", you should read "FirstSQL deliberately violates
the SQL spec". (There may be grounds to argue that the spec behavior
could be improved, but that's an argument to be making to the standards
committee, not here.)

regards, tom lane


From: Mike Mascari <mascarm(at)mascari(dot)com>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Planning for improved versions of IN/NOT IN
Date: 2002-11-30 11:37:41
Message-ID: 3DE8A305.2060806@mascari.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Mike Mascari <mascarm(at)mascari(dot)com> writes:
>
>>I curious if any of the rewriting of EXISTS and NOT EXISTS would
>>address the problem described by Date:

That should read "I'm curious"...

>
>
>>http://www.firstsql.com/iexist.htm
>
>
> We are not here to redefine the SQL spec ... and especially not to
> eliminate its concept of NULL, which is what Date would really like ;-)

From what I've read of Date's so far, I think he'd like to junk
SQL altogether.

> The above-quoted screed is based on a claimed logical equivalence
> between NOT EXISTS() and NOT IN() that is just plain wrong when you
> consider the possibility of NULLs. Rather than "FirstSQL correctly
> processes this query", you should read "FirstSQL deliberately violates
> the SQL spec". (There may be grounds to argue that the spec behavior
> could be improved, but that's an argument to be making to the standards
> committee, not here.)

Okay. I knew there was talk in the past that IN be rewritten as
EXISTS, which is not what you propose doing, but would have
exposed the odd behavior NOT EXISTS exhibits according to the
SQL spec. I was also curious to know which path PostgreSQL
development prefers to take when the SQL spec and the Relational
Model part ways, as they often do. Maybe someday RedHat will
have a voting member on the ANSI X3H2/NCITS committee. ;-)

Mike Mascari
mascarm(at)mascari(dot)com