IN vs EXISTS equivalence

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: IN vs EXISTS equivalence
Date: 2007-10-22 14:31:23
Message-ID: 471C6DEB.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've requested this before without response, but I'm asking again
because it just caused me pain again: could we get a TODO added to
have the planner recognize equivalent IN and EXISTS constructs and
have them compete on cost estimates? I know it's not a trivial
improvement, but if it's on the list maybe someone will pick it up,
and I see it as the single biggest weakness in PostgreSQL
performance.

I don't need help resolving this particular case, because the fix is
always blinding obvious when we hit this, and it doesn't even break
portability because no other database we've tested fails to recognize
these equivalent cases.

step=# explain DELETE FROM "Body" WHERE "bodySeqNo" NOT IN (SELECT "bodySeqNo" FROM "Message");
QUERY PLAN
----------------------------------------------------------------------------------
Seq Scan on "Body" (cost=90277.43..285235351699.39 rows=3313379 width=6)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=90277.43..159793.40 rows=6627957 width=11)
-> Seq Scan on "Message" (cost=0.00..80413.07 rows=6627957 width=11)
(5 rows)

step=# explain DELETE FROM "Body" WHERE NOT EXISTS (SELECT * FROM "Message" m WHERE m."bodySeqNo" = "Body"."bodySeqNo");
QUERY PLAN
--------------------------------------------------------------------------------------------
Seq Scan on "Body" (cost=0.00..3401760.88 rows=3313416 width=6)
Filter: (NOT (subplan))
SubPlan
-> Index Scan using "Message_Body" on "Message" m (cost=0.00..0.49 rows=1 width=136)
Index Cond: (("bodySeqNo")::numeric = ($0)::numeric)
(5 rows)

The bodySeqNo column is NOT NULL in both tables, and is the primary
key in the Body table. The Message table has a non-unique index on
it. (\d lists will follow at the bottom.)

I cancelled the first query after it had been running for 54 hours
over our slowest hours (the weekend). The second form ran in four
minutes in competition with peak time queries.

-Kevin


step=# \d "Body"
Table "public.Body"
Column | Type | Modifiers
-------------+------------------------+-----------
bodySeqNo | "SequenceT" | not null
contentType | character varying(255) | not null
encoding | character varying(255) |
body | "BodyT" |
Indexes:
"Body_pkey" PRIMARY KEY, btree ("bodySeqNo")

step=# \d "Message"
Table "public.Message"
Column | Type | Modifiers
-----------------+--------------------------+-----------
messageId | "SequenceT" | not null
clientMessageId | "ClientMessageIdT" | not null
correlationId | "SequenceT" |
destQueue | "QueueNameT" | not null
replyToQueue | "QueueNameT" | not null
typeCode | character(2) |
expiration | timestamp with time zone |
priority | smallint | not null
status | character(2) | not null
created | timestamp with time zone | not null
lastModified | timestamp with time zone | not null
bodySeqNo | "SequenceT" | not null
messageIdSearch | "PrioritySequenceT" | not null
Indexes:
"Message_pkey" PRIMARY KEY, btree ("messageId")
"MessageIndex2" UNIQUE, btree ("destQueue", "clientMessageId")
"Message_MessageIdSearch" UNIQUE, btree ("destQueue", status, "messageIdSearch") CLUSTER
"Message_Body" btree ("bodySeqNo")
"Message_Created" btree ("destQueue", status, created)
"Message_Created2" btree ("destQueue", created)
"Message_Expiration" btree (expiration)
"Message_LastModified" btree ("destQueue", "lastModified")
"Message_ReplyToQueue" btree ("replyToQueue")
Foreign-key constraints:
"Message_fk1" FOREIGN KEY ("destQueue") REFERENCES "Queue"(name)
"Message_fk2" FOREIGN KEY ("replyToQueue") REFERENCES "Queue"(name)


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2007-10-22 18:30:31
Message-ID: 1193077831.4319.61.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
> I've requested this before without response, but I'm asking again
> because it just caused me pain again: could we get a TODO added to
> have the planner recognize equivalent IN and EXISTS constructs and
> have them compete on cost estimates? I know it's not a trivial
> improvement, but if it's on the list maybe someone will pick it up,
> and I see it as the single biggest weakness in PostgreSQL
> performance.

I'll pick it up as a default unless someone requests they have it from
me.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2007-10-22 21:37:18
Message-ID: 471CD1BE.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Mon, Oct 22, 2007 at 1:30 PM, in message
<1193077831(dot)4319(dot)61(dot)camel(at)ebony(dot)site>, Simon Riggs <simon(at)2ndquadrant(dot)com>
wrote:
> On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
>> I've requested this before without response, but I'm asking again
>> because it just caused me pain again: could we get a TODO added to
>> have the planner recognize equivalent IN and EXISTS constructs and
>> have them compete on cost estimates? I know it's not a trivial
>> improvement, but if it's on the list maybe someone will pick it up,
>> and I see it as the single biggest weakness in PostgreSQL
>> performance.
>
> I'll pick it up as a default unless someone requests they have it from
> me.

Thanks, Simon.

One more logically equivalent, PostgreSQL-specific form which
costs out even better was suggested off-list:

step=# explain DELETE FROM "Body" USING "Message" WHERE "Message"."bodySeqNo" = "Body"."bodySeqNo";
QUERY PLAN
--------------------------------------------------------------------------------------------------
Merge Join (cost=0.00..696766.20 rows=4048543 width=6)
Merge Cond: (("Body"."bodySeqNo")::numeric = ("Message"."bodySeqNo")::numeric)
-> Index Scan using "Body_pkey" on "Body" (cost=0.00..326108.11 rows=4048543 width=18)
-> Index Scan using "Message_Body" on "Message" (cost=0.00..310085.16 rows=4048847 width=12)
(4 rows)

If both of the other syntaxes could compete against that, it would
be fantastic. (If that's feasible.)

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Kevin Grittner" <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2007-10-22 22:04:26
Message-ID: 471CD819.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Mon, Oct 22, 2007 at 4:37 PM, in message
<471CD1BE(dot)EE98(dot)0025(dot)0(at)wicourts(dot)gov>, "Kevin Grittner"
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> One more logically equivalent, PostgreSQL-specific form which
> costs out even better was suggested off-list:

Oops. That is not logically equivalent. We want to delete WHERE NOT
EXISTS; the logic of that suggestion is backwards.

Disregard that last post, please.

-Kevin


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Kevin Grittner" <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2007-10-23 06:04:17
Message-ID: 162867790710222304s7a3518e6j9369d8ae4f90e79b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2007/10/23, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>:
> >>> On Mon, Oct 22, 2007 at 4:37 PM, in message
> <471CD1BE(dot)EE98(dot)0025(dot)0(at)wicourts(dot)gov>, "Kevin Grittner"
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
> > One more logically equivalent, PostgreSQL-specific form which
> > costs out even better was suggested off-list:
>
> Oops. That is not logically equivalent. We want to delete WHERE NOT
> EXISTS; the logic of that suggestion is backwards.
>
> Disregard that last post, please.
>
> -Kevin
>
>
my mistake, sorry

Pavel


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2007-10-23 16:17:31
Message-ID: 471DD84B.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Mon, Oct 22, 2007 at 5:04 PM, in message
<471CD819(dot)EE98(dot)0025(dot)0(at)wicourts(dot)gov>, "Kevin Grittner"

> Oops. That is not logically equivalent. We want to delete WHERE NOT
> EXISTS; the logic of that suggestion is backwards.
>
> Disregard that last post, please.

Maybe that last post shouldn't be totally disregarded -- it wouldn't
be a bad idea to support a Merge NOT IN Join if it the effort isn't
out of line with the benefit.

Pavel suggested a clever kludge to accomplish this, which costs out
better than anything else I've tried:

step=# explain DELETE FROM "Body"
step-# WHERE "bodySeqNo" IN (SELECT "Body"."bodySeqNo"
step(# FROM "Body"
step(# LEFT JOIN "Message"
step(# ON "Body"."bodySeqNo" = "Message"."bodySeqNo"
step(# WHERE "Message"."bodySeqNo" IS NULL);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Merge IN Join (cost=825315.30..1265285.81 rows=2010418 width=6)
Merge Cond: ((public."Body"."bodySeqNo")::numeric = (public."Body"."bodySeqNo")::numeric)
-> Index Scan using "Body_pkey" on "Body" (cost=0.00..383702.32 rows=4020835 width=18)
-> Materialize (cost=825315.30..846401.18 rows=2010418 width=12)
-> Merge Left Join (cost=0.00..822323.18 rows=2010418 width=12)
Merge Cond: ((public."Body"."bodySeqNo")::numeric = ("Message"."bodySeqNo")::numeric)
Filter: ("Message"."bodySeqNo" IS NULL)
-> Index Scan using "Body_pkey" on "Body" (cost=0.00..383702.32 rows=4020835 width=12)
-> Index Scan using "Message_Body" on "Message" (cost=0.00..378901.17 rows=4021733 width=12)
(9 rows)

Just some ideas to look at while you're "in the neighborhood."

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: <Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-04 22:35:33
Message-ID: 48973DE5.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Mon, Oct 22, 2007 at 1:30 PM, Simon Riggs wrote:
> On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
>> I've requested this before without response, but I'm asking again
>> because it just caused me pain again: could we get a TODO added to
>> have the planner recognize equivalent IN and EXISTS constructs and
>> have them compete on cost estimates? I know it's not a trivial
>> improvement, but if it's on the list maybe someone will pick it up,
>> and I see it as the single biggest weakness in PostgreSQL
>> performance.
>
> I'll pick it up as a default unless someone requests they have it
from
> me.

Since Simon's focus has shifted to other issues, I'm hoping this can
go onto the TODO list.

I'm adding some NOT EXISTS examples to the thread for completeness of
what someone might want to address while working on it. For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over five
orders of magnitude. (Six if you compare to the LEFT JOIN ... WHERE
not_null_right_column IS NULL trick.)

Below are the cost estimates of the three techniques for a
medium-sized table joining to a large table, and for a large table
joining to a small table. The IN behavior has the worst worst-case
behavior, at least in queries that I've run, although many people
report that it is usually faster. The technique of doing an existence
test with a LEFT JOIN and then checking whether a NOT NULL column from
the right-hand table is null is often faster than either technique,
and seldom much worse than the best technique for any given test.

Queries and plans attached. Summary of costs below, in millions of
cost units. (Fractions of a million discarded.)

NOT IN (independent_subquery)
19745843, 5

WHERE NOT EXISTS
74, 318

LEFT JOIN WHERE not_null_right_column IS NULL
10, 17

These cost estimates tend to come out in pretty consistent ratio to
the actual run times.

-Kevin

Attachment Content-Type Size
not-exists-timings.txt text/plain 6.3 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org, "<Simon Riggs" <simon(at)2ndquadrant(dot)com>
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-04 23:48:12
Message-ID: 15026.1217893692@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> I'm adding some NOT EXISTS examples to the thread for completeness of
> what someone might want to address while working on it. For two
> queries which can easily be shown (to a human viewer, anyway) to
> return identical results, I see performance differences of over five
> orders of magnitude.

Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When people
are complaining of bad planner behavior, I don't find bare EXPLAIN
output to be very convincing.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "<Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-05 13:44:58
Message-ID: 4898130A.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Mon, Aug 4, 2008 at 6:48 PM, in message
<15026(dot)1217893692(at)sss(dot)pgh(dot)pa(dot)us>,
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> I'm adding some NOT EXISTS examples to the thread for completeness
of
>> what someone might want to address while working on it. For two
>> queries which can easily be shown (to a human viewer, anyway) to
>> return identical results, I see performance differences of over
five
>> orders of magnitude.
>
> Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When
people
> are complaining of bad planner behavior, I don't find bare EXPLAIN
> output to be very convincing.

I'll give it a shot. I've never had the patience to let the one with
the cost five or six orders of magnitude higher than the others run to
completion, but I've started the lot of 'em.

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "<Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-05 16:33:59
Message-ID: 48983AA7.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Mon, Aug 4, 2008 at 6:48 PM, in message
<15026(dot)1217893692(at)sss(dot)pgh(dot)pa(dot)us>,
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> I'm adding some NOT EXISTS examples to the thread for completeness
of
>> what someone might want to address while working on it. For two
>> queries which can easily be shown (to a human viewer, anyway) to
>> return identical results, I see performance differences of over
five
>> orders of magnitude.
>
> Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When
people
> are complaining of bad planner behavior, I don't find bare EXPLAIN
> output to be very convincing.

The other five queries have a cost to millisecond ratio of between 9.8
and 267. If the expensive one falls in the same range, it will run
for 2.3 to 64 years. I know I have left it running for days before
without completion. I don't think I can devote the resources to it.
Attached are the EXPLAIN ANALYZE output for the other five.

-Kevin

Attachment Content-Type Size
not-exists-timings2.out application/octet-stream 4.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-08 20:23:57
Message-ID: 17519.1218227037@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>>> I'm adding some NOT EXISTS examples to the thread for completeness
>>> of what someone might want to address while working on it. For two
>>> queries which can easily be shown (to a human viewer, anyway) to
>>> return identical results, I see performance differences of over
>>> five orders of magnitude.

I've been looking at this a bit and have an action plan worked out.

I believe that the optimizable cases for EXISTS are those where the
EXISTS() is either at the top level of WHERE, or just underneath a NOT,
and the contained subselect:

* has no set operations (UNION etc), grouping, set-returning functions
in the SELECT list, LIMIT, or a few other funny cases

* references outer-level variables only in its WHERE clause.

Given these conditions, an EXISTS is equivalent to a standard semijoin
between the outer relations used in its WHERE clause and the sub-select
with the WHERE removed, where the join condition is just the WHERE
clause. (A semijoin returns one copy of each LHS row for which there
exists at least one RHS row satisfying the join condition.)

Similarly, a NOT EXISTS is equivalent to a standard anti-semijoin.
(An anti-semijoin returns one copy of each LHS row for which there
exists no RHS row satisfying the join condition.)

So what I think we should do is implement JOIN_SEMI and JOIN_ANTI
as variant outer-join types in the planner and executor, and convert
EXISTS into those. JOIN_SEMI would replace the existing JOIN_IN support.
(It's effectively the same thing so far as the executor is concerned,
but there are some places in the planner that assume an IN join condition
consists of one or more btree equality operators. I'm envisioning folding
the current planner support for IN into the more general outer-join
planning infrastructure, and fixing it so that it doesn't assume the
join clause must be of that form.)

Explicit support for JOIN_ANTI is probably not strictly necessary:
we could represent it using the "LEFT JOIN WHERE rhs-variable IS NULL"
hack. However this only works if the join's ON-condition can be proved
strict for at least one RHS variable, which is an assumption I'd rather
not require for optimizing EXISTS. Also, the planner should be able to
make much better estimates of the size of an antijoin result if it has an
explicit concept that that's what's happening. (The existing kluge in
nulltestsel() is not only ugly but has little hope of ever giving an
accurate estimate.) So I'd prefer to go the other way: make the planner
recognize the IS NULL trick and remove the IS NULL clause in favor of
marking the LEFT JOIN as being really an antijoin.

As far as IN goes: IN at the top level of WHERE can still be optimized
to a semijoin under the same conditions as now. NOT IN is a lot trickier,
for the same reason that typically trips up novices who try to use it:
if any row of the subselect produces a NULL comparison result, then it is
impossible for the NOT IN to result in TRUE, which means that it does not
function as a standard antijoin. I thought about optimizing it only in
the case where we can prove that the subselect outputs and the compared-to
values are known NOT NULL (which in typical cases we could prove by
looking for NOT NULL constraints on those table columns). The trouble
with this is that that's not a sufficient condition: you must also assume
that the comparison operator involved never yields NULL for non-null
inputs. That might be okay for btree comparison functions but it's not a
very comfy assumption in general; we certainly haven't got any explicit
knowledge that any functions are guaranteed to act that way. So this
case might be worth doing later but I'm not feeling excited about it.
We generally tell people to avoid NOT IN and I'm happy to keep on
saying that.

Comments?

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-08 22:25:16
Message-ID: 489C817C.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I believe that the optimizable cases for EXISTS are those where the
> EXISTS() is either at the top level of WHERE, or just underneath a
NOT,

The rest of the plan makes sense to me, but this part seems narrow.
There's probably a good reason for that which is beyond my depth, but
attached is a view that is used for calculating statistics for a
database which is primarily used for "case management" purposes. If
EXISTS could also be optimized in the contexts used there, it would be
great.

For context, this view references three other non-trivial views
(MatterHistSearch, MatterHistStage, & MatterHistStatus), and does
perform acceptably, so it's not a matter of complaining if it can't
work here, just providing a real-world example of other useful
contexts for EXISTS which might be worth covering if possible.

This view is used in a large number of queries, mostly either
selecting a single date with other selection criteria or counting rows
which match a set of matterNo values selected on complex criteria.

By the way, this view was totally unusable under 8.2. Showing how it
worked under 8.3 was all it took to get them to expedite the upgrade.
These views, possible because of improvements in 8.3, saved "countless
programmer days" (to quote one of the programmers).

-Kevin

Attachment Content-Type Size
MatterDateStat.txt text/plain 2.9 KB

From: Decibel! <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-11 15:19:21
Message-ID: DD52A3AC-9ED8-47F4-89D8-9D70A8B2DB87@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
> * has no set operations (UNION etc), grouping, set-returning functions
> in the SELECT list, LIMIT, or a few other funny cases

Couldn't union/union all be treated as

EXISTS(a)
OR EXISTS(b)
...

Or am I missing some detail with NULLS?

Personally, I'd rather write it as separate EXISTS clauses rather
than using UNION, but perhaps others have a different preference...
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-11 16:59:27
Message-ID: 6531.1218473967@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I believe that the optimizable cases for EXISTS are those where the
>> EXISTS() is either at the top level of WHERE, or just underneath a
>> NOT,

> The rest of the plan makes sense to me, but this part seems narrow.
> There's probably a good reason for that which is beyond my depth, but
> attached is a view that is used for calculating statistics for a
> database which is primarily used for "case management" purposes. If
> EXISTS could also be optimized in the contexts used there, it would be
> great.

I chewed on that for awhile. We can certainly optimize EXISTS that's
appearing in the ON-condition of a regular JOIN, because that's not
really semantically different from a WHERE-condition. But I don't think
it's going to be reasonable to improve EXISTS in outer-JOIN ON
conditions. There are a couple of problems. Consider

t1 LEFT JOIN t2
ON (t1.f1 = t2.f2 AND
EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx AND t3.f4 = t2.fy))

To implement this with the correct semantics, we'd have to have the
t1/t2 join and the t1/t2/t3 join going on in the same execution node,
with two different join behaviors (LEFT and SEMI). There isn't any
way AFAICS to factor it into two separate steps. That's unreasonably
complicated, and it's not clear that you'd get any better performance
anyway than the current implementation (which'd treat the EXISTS as
a subplan).

Even worse, let the EXISTS be a degenerate case:

t1 LEFT JOIN t2
ON (t1.f1 = t2.f2 AND
EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx));

We can't actually treat this EXISTS as a semijoin between t1 and t3,
either before or after the LEFT JOIN; because then the behavior would be
to drop t1 rows that have no t3 match, which is not what this query
specifies.

(Note: the other degenerate case, where the EXISTS depends only on
t2, *could* be optimized since we could just let the semijoin be
performed within the RHS of the LEFT JOIN.)

So this is not something I'm going to tackle; at least not this
devel cycle.

One small step we can take in this direction, though, is to improve the
planner's internal handling of the qual conditions for IN and EXISTS.
Right now the process is just to throw the sub-select into the main
range table and put the IN join conditions into the same place in WHERE
that the IN-clause was to start with. The trouble with this is that the
distribute_quals_to_rels processing has no idea that there's anything
special about the IN join conditions. We got away with that for the
limited case of IN clauses at the top level of WHERE, but it's become
clear to me over the weekend that this has no hope of working for NOT
EXISTS --- since that's effectively an outer join, it absolutely has to
have the same kinds of qual-scheduling constraints as ordinary outer
joins do. So we need a data structure that distribute_quals_to_rels can
work with. What I think needs to happen is that the initial pass that
pulls up optimizable IN/EXISTS sub-selects should not merge the
SubLink's replacement qual clauses seamlessly, but put them underneath a
new node type, say "FlattenedSubLink", that retains knowledge of the
join it's representing. The FlattenedSubLink would survive only as far
as distribute_quals_to_rels, which would distribute out the contained
qual conditions instead of the FlattenedSubLink itself --- but only
after marking them properly for the outer-join restrictions. This
representation would make it feasible to work with IN/EXISTS that are
inside JOIN ON conditions, which the present representation using a
single in_info_list really can't do. The semantic issues are still
there but at least the representation isn't getting in the way ...

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Decibel!" <decibel(at)decibel(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-11 20:40:14
Message-ID: 87ej4vxp3l.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Decibel!" <decibel(at)decibel(dot)org> writes:

> On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
>> * has no set operations (UNION etc), grouping, set-returning functions
>> in the SELECT list, LIMIT, or a few other funny cases
>
>
> Couldn't union/union all be treated as
>
> EXISTS(a)
> OR EXISTS(b)

Kind of confused by what you mean here. Can you give an example?

The usual transformation to consider with UNION is to transform

SELECT ... WHERE x OR y

into

SELECT ...
WHERE x
UNION ALL
SELECT ...
WHERE y AND NOT x

(modulo handling NULLs properly)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-12 01:45:45
Message-ID: 24375.1218505545@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! <decibel(at)decibel(dot)org> writes:
> On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
>> * has no set operations (UNION etc), grouping, set-returning functions
>> in the SELECT list, LIMIT, or a few other funny cases

> Couldn't union/union all be treated as
> EXISTS(a)
> OR EXISTS(b)

Perhaps, but that would end up defeating the optimization anyway,
because as soon as the EXISTS is underneath an OR, it's no longer
representing a potential join clause.

regards, tom lane


From: Decibel! <decibel(at)decibel(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-12 23:01:18
Message-ID: DB80258E-D1FB-4AFF-B99F-3FFFAC0AD22F@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 11, 2008, at 3:40 PM, Gregory Stark wrote:
> "Decibel!" <decibel(at)decibel(dot)org> writes:
>> On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
>>> * has no set operations (UNION etc), grouping, set-returning
>>> functions
>>> in the SELECT list, LIMIT, or a few other funny cases
>>
>>
>> Couldn't union/union all be treated as
>>
>> EXISTS(a)
>> OR EXISTS(b)
>
> Kind of confused by what you mean here. Can you give an example?
>
> The usual transformation to consider with UNION is to transform
>
> SELECT ... WHERE x OR y
>
> into
>
> SELECT ...
> WHERE x
> UNION ALL
> SELECT ...
> WHERE y AND NOT x

It was a case of the union being inside the EXISTS subquery, ie:

WHERE EXISTS (SELECT * FROM a UNION SELECT * FROM b)

AFAIK that's identical to

WHERE EXISTS (SELECT * FROM a) OR EXISTS (SELECT * FROM b)

But as Tom mentioned, as soon as you have it you can't answer it with
a simple join, so it doesn't matter...
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-14 17:50:09
Message-ID: 1218736209.5343.537.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:

> NOT IN is a lot trickier,
> for the same reason that typically trips up novices who try to use it:
> if any row of the subselect produces a NULL comparison result, then it
> is impossible for the NOT IN to result in TRUE, which means that it
> does not function as a standard antijoin. I thought about optimizing
> it only in the case where we can prove that the subselect outputs and
> the compared-to values are known NOT NULL (which in typical cases we
> could prove by looking for NOT NULL constraints on those table
> columns). The trouble with this is that that's not a sufficient
> condition: you must also assume that the comparison operator involved
> never yields NULL for non-null inputs. That might be okay for btree
> comparison functions but it's not a very comfy assumption in general;
> we certainly haven't got any explicit knowledge that any functions are
> guaranteed to act that way. So this case might be worth doing later
> but I'm not feeling excited about it. We generally tell people to
> avoid NOT IN and I'm happy to keep on saying that.

Just found this comment, after reading what you said on other thread
about NOT IN.

NOT IN is a serious performance issue for most people. We simply can't
say to people "you were told not to".

If we can fix it easily for the majority of cases, we should. We can't
let the "it won't work in certain cases" reason prevent various
optimizations from going in. There are tons of places where we say "XXX
needs later improvement" in code comments. So lets do that here also. It
certainly wouldn't be the first optimization/feature that went into code
in a restricted way that didn't work for all cases: hash joins, ANALYZE,
partial indexes etc..

Anybody that is writing complex SQL with user defined operators knows
enough to re-write their queries correctly, so there will be almost no
negative effect from making the NOT IN optimisation a special case. And
if there is an effect, the people effected can fix the problem.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: daveg <daveg(at)sonic(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-09-03 06:17:24
Message-ID: 20080903061724.GV2648@sonic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 14, 2008 at 06:50:09PM +0100, Simon Riggs wrote:
>
> On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:
>
> > NOT IN is a lot trickier,
> > condition: you must also assume that the comparison operator involved
> > never yields NULL for non-null inputs. That might be okay for btree
> > comparison functions but it's not a very comfy assumption in general;
> > we certainly haven't got any explicit knowledge that any functions are
> > guaranteed to act that way. So this case might be worth doing later
...
> Just found this comment, after reading what you said on other thread
> about NOT IN.
>
> NOT IN is a serious performance issue for most people. We simply can't
> say to people "you were told not to".
>
> If we can fix it easily for the majority of cases, we should. We can't
> let the "it won't work in certain cases" reason prevent various

A suggestion: what about adding an attribute to functions to declare that
they never return null?

declare foo(int, int) returns int immutable not null as ...

-dg

--
David Gould daveg(at)sonic(dot)net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.


From: "Asko Oja" <ascoja(at)gmail(dot)com>
To: daveg <daveg(at)sonic(dot)net>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-09-03 08:27:55
Message-ID: ecd779860809030127x4219ec05y3f8dfa28d71dadc1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 3, 2008 at 9:17 AM, daveg <daveg(at)sonic(dot)net> wrote:

> On Thu, Aug 14, 2008 at 06:50:09PM +0100, Simon Riggs wrote:
> >
> > On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:
> >
> > > NOT IN is a lot trickier,
> > > condition: you must also assume that the comparison operator involved
> > > never yields NULL for non-null inputs. That might be okay for btree
> > > comparison functions but it's not a very comfy assumption in general;
> > > we certainly haven't got any explicit knowledge that any functions are
> > > guaranteed to act that way. So this case might be worth doing later
> ...
> > Just found this comment, after reading what you said on other thread
> > about NOT IN.
> >
> > NOT IN is a serious performance issue for most people. We simply can't
> > say to people "you were told not to".
> >
> > If we can fix it easily for the majority of cases, we should. We can't
> > let the "it won't work in certain cases" reason prevent various
>
> A suggestion: what about adding an attribute to functions to declare that
> they never return null?
>
And if function still returns null then error will be raised?
Then you will end up adding NOT NULL also to IN and OUT parameters.
IIRC it was possible in Oracle to declare local variables NOT NULL.

>
> declare foo(int, int) returns int immutable not null as ...
>
>
> -dg
>
>
> --
> David Gould daveg(at)sonic(dot)net 510 536 1443 510 282 0869
> If simplicity worked, the world would be overrun with insects.
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>