Improving NOT IN

Lists: pgsql-hackers
From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Improving NOT IN
Date: 2007-01-30 22:03:12
Message-ID: 1170194592.3681.271.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

It's a fairly common case to want to improve a query along the lines of
TableA intersect ~TableB.

We can write this as

select * from tableA
where key not in
(select * from tableB)

or we can get more fancy

select tableA.*
from tableA left outer join tableB
on tableA.key = tableB.key
where tableB is null;

I've worked out a new join method that will improve performance over and
above the second case. This is effective where the referenced table
(tableB) is fairly large and the join columns are discrete. Currently
that mostly means they're integers.

The plan seeks to *prove* that there are no matches, rather than taking
the exhaustive join approach taken currently.

First we need to show that the referenced table's PK values are a fully
continuous sequence of integers with no gaps. One this has been proved,
we can then use that fact to scan the FK table using the values of the
min and max PK to see if any outliers exist. There is no actual
comparison of the values, just a proof that none is required.

I'll describe this using SQL statements, which execute as SeqScans of
the PK and then FK tables. There is no preparatory step - no building a
sort table or preparing a hash table, so these SQL statements always
execute faster than the fastest current plan. Most importantly there is
no step that consumes large amounts of memory, so the case where two
tables are very large performs much, much better.

1. Scan referenced table
a) select max(aid), min(aid) from accounts;
b) select count(*) from accounts;
Sometimes this is faster using two queries when the table has a PK.

2. Decision Step
if max - min - count == 0 then we have a contiguous range and because we
know we have a discrete datatype we can now *prove* that there are no
missing values from the set bounded by the min and the max. We can then
use that directly in a new query:

3.
a) Scan referencing table
select aid from history where aid > ? or aid < ?;
using parameters of max and min from step 1

b) normal query

Step 1 & 2 can fail to find a contiguous range, in which case we need to
fall back to an existing query plan. So there is only small overhead in
the case where we run the first query but fail to use the optimisation
at all and need to fall back to existing query. We can estimate whether
this is the case by estimating the row count of the table and see if
that compares favourably with the expected number of values if the whole
range min-max of values is actually present. The min/max query uses the
Primary Key index (which must always be present) so takes very little
time.

So overall this looks like a win, in certain common cases, but not a
particular loss in any case.

Try this SQL

1. select max(aid), min(aid) from accounts;
2. select count(*) from accounts;
3. select aid from history where aid > (select max(aid) from accounts)
or aid < (select min(aid) from accounts) limit 1;

against

alter table history add foreign key (aid) references accounts;

I get (1) about 0.2secs (2) 6secs (3) 9secs against Alter Table 27secs

Using work_mem = 64MB and data that fits in memory

We could implement the new SQL directly within ALTER TABLE, or we could
actually create this as a new plan type that would then allow the
existing SQL to perform better in specific cases.

I've not seen such a plan on any other RDBMS and think it might be
completely new, which I'm calling a Proof Join, for want of a better
description. The preparatory steps are completely excluded, hence the x2
speedup. For larger referenced tables the performance improvement could
be much more.

ISTM that even though this is a special case it is actually a common
one, so would be worth optimising for.

Ideas stage at the moment: thoughts?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving NOT IN
Date: 2007-01-30 22:34:25
Message-ID: 12336.1170196465@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> First we need to show that the referenced table's PK values are a fully
> continuous sequence of integers with no gaps.

Since that is unlikely to be the case, I can't see that this is worth
implementing...

> I'll describe this using SQL statements, which execute as SeqScans of
> the PK and then FK tables. There is no preparatory step - no building a
> sort table or preparing a hash table, so these SQL statements always
> execute faster than the fastest current plan.

Except that when you fail to prove it, as you usually will, you have
wasted a complete seqscan of the table, and still have to fall back on
a regular plan. If the thing were amenable to falling out fairly
quickly on proof failure, it would be better, but AFAICS you don't know
anything until you've completed the whole scan.

I think the NOT IN optimization that *would* be of use is to
automatically transform the NOT IN representation to an
outer-join-with-null-test type of operation, so as to give us a wider
choice of join methods. However, I'm not sure about correct handling
of NULLs on the RHS in such a scenario. The existing hashed-IN code
has to jump through some really ugly hoops to give spec-compliant
answers with NULLs.

BTW, your sketch fails in the presence of NULLs on the RHS ...

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving NOT IN
Date: 2007-01-30 22:55:29
Message-ID: 1170197729.3681.301.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-01-30 at 17:34 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > First we need to show that the referenced table's PK values are a fully
> > continuous sequence of integers with no gaps.
>
> Since that is unlikely to be the case, I can't see that this is worth
> implementing...

Integers are typically used as keys...

> > I'll describe this using SQL statements, which execute as SeqScans of
> > the PK and then FK tables. There is no preparatory step - no building a
> > sort table or preparing a hash table, so these SQL statements always
> > execute faster than the fastest current plan.
>
> Except that when you fail to prove it, as you usually will, you have
> wasted a complete seqscan of the table, and still have to fall back on
> a regular plan. If the thing were amenable to falling out fairly
> quickly on proof failure, it would be better, but AFAICS you don't know
> anything until you've completed the whole scan.

Have some faith, please. It's fairly straightforward to make an estimate
of whether the number of rows is approximately correct to make the scan
worthwhile. On large queries it seems worth the risk; we might even
store the answer as part of stats, so we'd know not to bother with the
test in the future.

> BTW, your sketch fails in the presence of NULLs on the RHS ...

Certainly does, but the typical query has PK there, so no NULLs. One of
the main use cases is the ALTER TABLE ... ADD FK case. As I said, we
could just code that with altered SQL, or we could add a new plan.

Anyway, it seemed like the right time to log the thought anyhow.

> I think the NOT IN optimization that *would* be of use is to
> automatically transform the NOT IN representation to an
> outer-join-with-null-test type of operation, so as to give us a wider
> choice of join methods. However, I'm not sure about correct handling
> of NULLs on the RHS in such a scenario. The existing hashed-IN code
> has to jump through some really ugly hoops to give spec-compliant
> answers with NULLs.

Yeh, NOT IN with NULLs is..... bizarre.

What would be wrong with checking for a NOT NULL constraint? Thats how
other planners cope with it. Or are you thinking about lack of plan
invalidation?

ISTM straightforward to do a search for a ANDed set of IS NOT NULL
constraints. I've not found another server that does that, even though
it seems like a straightforward win.

Let me think on that.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving NOT IN
Date: 2007-01-30 23:06:11
Message-ID: 13113.1170198371@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> On Tue, 2007-01-30 at 17:34 -0500, Tom Lane wrote:
>> Since that is unlikely to be the case, I can't see that this is worth
>> implementing...

> Integers are typically used as keys...

Yeah, in the form of sequences, so you have a hole for every failed
insert. If the key isn't coming from a sequence then there's still
not any very good reason to suppose it's exactly contiguous. People
do delete entries.

> What would be wrong with checking for a NOT NULL constraint? Thats how
> other planners cope with it. Or are you thinking about lack of plan
> invalidation?

Yup, without that, depending on constraints for plan correctness is
pretty risky.

Basically what I see here is a whole lot of work and new executor
infrastructure for something that will be a win in a very narrow
use-case and a significant loss the rest of the time. I think there
are more productive ways to spend our development effort.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving NOT IN
Date: 2007-01-30 23:24:40
Message-ID: 1170199480.3681.319.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-01-30 at 18:06 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > What would be wrong with checking for a NOT NULL constraint? Thats how
> > other planners cope with it. Or are you thinking about lack of plan
> > invalidation?
>
> Yup, without that, depending on constraints for plan correctness is
> pretty risky.
>
> Basically what I see here is a whole lot of work and new executor
> infrastructure for something that will be a win in a very narrow
> use-case and a significant loss the rest of the time. I think there
> are more productive ways to spend our development effort.

For that part of the email, I was talking about your ideas on NOT IN.

Checking for the explicit exclusion of NULLs is worthwhile with/without
plan invalidation.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Jens-Wolfhard Schicke <ml+pgsql-hackers(at)asco(dot)de>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving NOT IN
Date: 2007-01-31 08:28:42
Message-ID: BF40A06E14817846EE9CDD4F@[192.168.1.72]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On Dienstag, Januar 30, 2007 23:24:40 +0000 Simon Riggs
<simon(at)2ndquadrant(dot)com> wrote:
>> Basically what I see here is a whole lot of work and new executor
>> infrastructure for something that will be a win in a very narrow
>> use-case and a significant loss the rest of the time. I think there
>> are more productive ways to spend our development effort.
Maybe one should not aim for a special case of continuous sequences. It
might be a better thing to have a fast look-up datastructure for
row-existence. The optimization over the usual indices is that only
existence, and no other information must be saved, thus a bit-field is
really possible. Even 100 Mio rows would fit in 10 MB.

So, instead of trying to find a sequence, find (or guess and later correct
your bitfield) the minimum, and then set the bits as you encounter rows.
During the join, test whether the bit you want to join to exists and voila,
depending on whether you used IN or NOT IN, decide what to do.

This datastructure could be used everywhere where only existence is
important and no columns of a table are selected.

Possibly, the bit-field should allow for large-gaps to be represented more
efficiently, if you have an 32-bit index column, make a 256 entry top-level
array pointing to bitfields representing the numbers 0x0-0x00ffffff,
0x01000000 - 0x01ffffff... each such bitfield would need 2MB, the pointers
are negligible. But now large holes in the sequence don't waste too much
space and thus the minimum needs not to be known.

Regards,
Jens Schicke
--
Jens Schicke j(dot)schicke(at)asco(dot)de
asco GmbH http://www.asco.de
Mittelweg 7 Tel 0531/3906-127
38106 Braunschweig Fax 0531/3906-400


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving NOT IN
Date: 2007-02-01 11:49:55
Message-ID: 1170330596.3681.514.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-01-30 at 17:34 -0500, Tom Lane wrote:

> I think the NOT IN optimization that *would* be of use is to
> automatically transform the NOT IN representation to an
> outer-join-with-null-test type of operation, so as to give us a wider
> choice of join methods. However, I'm not sure about correct handling
> of NULLs on the RHS in such a scenario. The existing hashed-IN code
> has to jump through some really ugly hoops to give spec-compliant
> answers with NULLs.

ISTM that we can handle this neatly by looking for a WHERE clause that
specifically excludes NULLs in the NOT IN.

i.e. a query of the form

select ...
from LHS
where key NOT IN
(select key
from RHS
where key is not null)

can be optimised to

select ...
from LHS left outer join RHS
on LHS.key = RHS.key
where RHS.key IS NULL;

This rewards people that understand the spec-compliant behaviour and
ensure there coding is watertight in the presence of NULLs.

We can extend that behaviour later when we have plan invalidation to
make it also pick up NOT NULL constraints on the keys of the RHS table.

Doing it this way round is much more useful, since not all tables have
NOT NULL constraints on their join columns so is slightly wider in its
usefulness than just checking constraints. It's also very annoying in
this specific case to not have any way for the SQL developer to pass
information to the planner - and I don't mean hints.

This would be similar to pull_up_IN_clauses()

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com