EXISTS optimization

Lists: pgsql-hackerspgsql-performance
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>,<pgsql-performance(at)postgresql(dot)org>
Cc: "Andrea Olson" <Andrea(dot)Olson(at)wicourts(dot)gov>, "Bill Severson" <Bill(dot)Severson(at)wicourts(dot)gov>, "John Hutchins" <John(dot)Hutchins(at)wicourts(dot)gov>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Randy Peterson" <Randy(dot)Peterson(at)wicourts(dot)gov>, "Shannon Spranger" <Shannon(dot)Spranger(at)wicourts(dot)gov>
Subject: EXISTS optimization
Date: 2007-03-23 17:01:40
Message-ID: 4603C1A3.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

I'm posting this to performance in case our workaround may be of benefit to someone with a similar issue. I'm posting to hackers because I hope we can improve our planner in this area so that a workaround is not necessary. (It might make sense to reply to one group or the other, depending on reply content.)

We are converting from a commercial database (which shall remain unnamed here, due to license restrictions on publishing benchmarks). Most queries run faster on PostgreSQL; a small number choose very poor plans and run much longer. This particular query runs on the commercial product in 6.1s first time, 1.4s cached. In PostgreSQL it runs in about 144s both first time and cached. I was able to use an easy but fairly ugly rewrite (getting duplicate rows and eliminating them with DISTINCT) which runs on the commercial product in 9.2s/3.0s and in PostgreSQL in 2.0s/0.7s.

Here are the tables:

Table "public.TranHeader"
Column | Type | Modifiers
---------------+------------------+-----------
tranNo | "TranNoT" | not null
countyNo | "CountyNoT" | not null
acctPd | "DateT" | not null
date | "DateT" | not null
isComplete | boolean | not null
tranId | "TranIdT" | not null
tranType | "TranTypeT" | not null
userId | "UserIdT" | not null
workstationId | "WorkstationIdT" | not null
time | "TimeT" |
Indexes:
"TranHeader_pkey" PRIMARY KEY, btree ("tranNo", "countyNo")
"TranHeader_TranAcctPeriod" UNIQUE, btree ("acctPd", "tranNo", "countyNo")
"TranHeader_TranDate" UNIQUE, btree (date, "tranNo", "countyNo")

Table "public.TranDetail"
Column | Type | Modifiers
-----------------+--------------------+-----------
tranNo | "TranNoT" | not null
tranDetailSeqNo | "TranDetailSeqNoT" | not null
countyNo | "CountyNoT" | not null
acctCode | "AcctCodeT" | not null
amt | "MoneyT" | not null
assessNo | "TranIdT" |
caseNo | "CaseNoT" |
citnNo | "CitnNoT" |
citnViolDate | "DateT" |
issAgencyNo | "IssAgencyNoT" |
partyNo | "PartyNoT" |
payableNo | "PayableNoT" |
rcvblNo | "RcvblNoT" |
Indexes:
"TranDetail_pkey" PRIMARY KEY, btree ("tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetCaseNo" UNIQUE, btree ("caseNo", "tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetPay" UNIQUE, btree ("payableNo", "tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetRcvbl" UNIQUE, btree ("rcvblNo", "tranNo", "tranDetailSeqNo", "countyNo")
"TranDetail_TranDetAcct" btree ("acctCode", "citnNo", "countyNo")

Table "public.Adjustment"
Column | Type | Modifiers
-----------------+-----------------------+-----------
adjustmentNo | "TranIdT" | not null
countyNo | "CountyNoT" | not null
date | "DateT" | not null
isTranVoided | boolean | not null
reasonCode | "ReasonCodeT" | not null
tranNo | "TranNoT" | not null
adjustsTranId | "TranIdT" |
adjustsTranNo | "TranNoT" |
adjustsTranType | "TranTypeT" |
explanation | character varying(50) |
Indexes:
"Adjustment_pkey" PRIMARY KEY, btree ("adjustmentNo", "countyNo")
"Adjustment_AdjustsTranId" btree ("adjustsTranId", "adjustsTranType", "tranNo", "countyNo")
"Adjustment_AdjustsTranNo" btree ("adjustsTranNo", "tranNo", "countyNo")
"Adjustment_Date" btree (date, "countyNo")

Admittedly, the indexes are optimized for our query load under the commercial product, which can use the "covering index" optimization.

explain analyze
SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time"
FROM "Adjustment" "A"
JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
WHERE "H"."tranType" = 'A'
AND "A"."date" > DATE '2006-01-01'
AND "H"."countyNo" = 66
AND "A"."countyNo" = 66
AND EXISTS
(
SELECT 1 FROM "TranDetail" "D"
WHERE "D"."tranNo" = "H"."tranNo"
AND "D"."countyNo" = "H"."countyNo"
AND "D"."caseNo" LIKE '2006TR%'
)
;

Nested Loop (cost=182.56..72736.37 rows=1 width=46) (actual time=6398.108..143631.427 rows=2205 loops=1)
Join Filter: (("H"."tranId")::bpchar = ("A"."adjustmentNo")::bpchar)
-> Bitmap Heap Scan on "Adjustment" "A" (cost=182.56..1535.69 rows=11542 width=22) (actual time=38.098..68.324 rows=12958 loops=1)
Recheck Cond: (((date)::date > '2006-01-01'::date) AND (("countyNo")::smallint = 66))
-> Bitmap Index Scan on "Adjustment_Date" (cost=0.00..179.67 rows=11542 width=0) (actual time=32.958..32.958 rows=12958 loops=1)
Index Cond: (((date)::date > '2006-01-01'::date) AND (("countyNo")::smallint = 66))
-> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..6.15 rows=1 width=46) (actual time=11.073..11.074 rows=0 loops=12958)
Index Cond: ((("H"."tranNo")::integer = ("A"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))
Filter: ((("tranType")::bpchar = 'A'::bpchar) AND (subplan))
SubPlan
-> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..4.73 rows=1 width=0) (actual time=11.038..11.038 rows=0 loops=12958)
Index Cond: ((("caseNo")::bpchar >= '2006TR'::bpchar) AND (("caseNo")::bpchar < '2006TS'::bpchar) AND (("tranNo")::integer = ($0)::integer) AND (("countyNo")::smallint = ($1)::smallint))
Filter: (("caseNo")::bpchar ~~ '2006TR%'::text)
Total runtime: 143633.838 ms

The commercial product scans the index on caseNo in TranDetail to build a work table of unique values, then uses indexed access to the TranHeader and then to Adjustment. I was able to get approximately the same plan (except the duplicates are eliminated at the end) in PostgreSQL by rewriting to this:

SELECT DISTINCT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time"
FROM "Adjustment" "A"
JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
JOIN "TranDetail" "D" ON ("D"."tranNo" = "H"."tranNo" AND "D"."countyNo" = "H"."countyNo" AND "D"."caseNo" LIKE '2006TR%')
WHERE "H"."tranType" = 'A'
AND "A"."date" > DATE '2006-01-01'
AND "H"."countyNo" = 66
AND "A"."countyNo" = 66
;

Unique (cost=130.96..130.98 rows=1 width=46) (actual time=694.591..715.008 rows=2205 loops=1)
-> Sort (cost=130.96..130.96 rows=1 width=46) (actual time=694.586..701.808 rows=16989 loops=1)
Sort Key: "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H".date, "H"."userId", "H"."time"
-> Nested Loop (cost=0.00..130.95 rows=1 width=46) (actual time=0.157..636.779 rows=16989 loops=1)
Join Filter: (("H"."tranNo")::integer = ("A"."tranNo")::integer)
-> Nested Loop (cost=0.00..113.76 rows=4 width=50) (actual time=0.131..452.544 rows=16989 loops=1)
-> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..27.57 rows=20 width=6) (actual time=0.049..83.005 rows=46293 loops=1)
Index Cond: ((("caseNo")::bpchar >= '2006TR'::bpchar) AND (("caseNo")::bpchar < '2006TS'::bpchar) AND (66 = ("countyNo")::smallint))
Filter: (("caseNo")::bpchar ~~ '2006TR%'::text)
-> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..4.30 rows=1 width=46) (actual time=0.006..0.007 rows=0 loops=46293)
Index Cond: ((("D"."tranNo")::integer = ("H"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))
Filter: (("tranType")::bpchar = 'A'::bpchar)
-> Index Scan using "Adjustment_pkey" on "Adjustment" "A" (cost=0.00..4.28 rows=1 width=22) (actual time=0.007..0.008 rows=1 loops=16989)
Index Cond: ((("H"."tranId")::bpchar = ("A"."adjustmentNo")::bpchar) AND (("A"."countyNo")::smallint = 66))
Filter: ((date)::date > '2006-01-01'::date)
Total runtime: 715.932 ms

I can't see any reason that PostgreSQL can't catch up to the other product on this optimization issue. This usage of DISTINCT seems a bit sloppy; I usually try to dissuade the application programmers from accumulating duplicates during the joins and then eliminating them in this way.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, "Andrea Olson" <Andrea(dot)Olson(at)wicourts(dot)gov>, "Bill Severson" <Bill(dot)Severson(at)wicourts(dot)gov>, "John Hutchins" <John(dot)Hutchins(at)wicourts(dot)gov>, "Randy Peterson" <Randy(dot)Peterson(at)wicourts(dot)gov>, "Shannon Spranger" <Shannon(dot)Spranger(at)wicourts(dot)gov>
Subject: Re: EXISTS optimization
Date: 2007-03-23 21:49:42
Message-ID: 25339.1174686582@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> explain analyze
> SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time"
> FROM "Adjustment" "A"
> JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
> WHERE "H"."tranType" = 'A'
> AND "A"."date" > DATE '2006-01-01'
> AND "H"."countyNo" = 66
> AND "A"."countyNo" = 66
> AND EXISTS
> (
> SELECT 1 FROM "TranDetail" "D"
> WHERE "D"."tranNo" = "H"."tranNo"
> AND "D"."countyNo" = "H"."countyNo"
> AND "D"."caseNo" LIKE '2006TR%'
> )
> ;

> The commercial product scans the index on caseNo in TranDetail to build a work table of unique values, then uses indexed access to the TranHeader and then to Adjustment.

If you want that, try rewriting the EXISTS to an IN:

AND ("H"."tranNo", "H"."countyNo") IN
(
SELECT "D"."tranNo", "D"."countyNo" FROM "TranDetail" "D"
WHERE "D"."caseNo" LIKE '2006TR%'
)

We don't currently try to flatten EXISTS into a unique/join plan as we
do for IN. I seem to recall not doing so when I rewrote IN planning
because I didn't think it would be exactly semantically equivalent,
but that was awhile ago. Right at the moment it seems like it ought
to be equivalent as long as the comparison operators are strict.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>,<pgsql-performance(at)postgresql(dot)org>, "Andrea Olson" <Aolson(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Bill Severson" <BSEVERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "John Hutchins" <jhutchi(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Randy Peterson" <RPETERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Shannon Spranger" <ssprang(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: EXISTS optimization
Date: 2007-03-23 22:26:04
Message-ID: 46040DAC.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

>>> On Fri, Mar 23, 2007 at 4:49 PM, in message <25339(dot)1174686582(at)sss(dot)pgh(dot)pa(dot)us>,
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> explain analyze
>> SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date",
> "H"."userId", "H"."time"
>> FROM "Adjustment" "A"
>> JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND
> "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
>> WHERE "H"."tranType" = 'A'
>> AND "A"."date" > DATE '2006- 01- 01'
>> AND "H"."countyNo" = 66
>> AND "A"."countyNo" = 66
>> AND EXISTS
>> (
>> SELECT 1 FROM "TranDetail" "D"
>> WHERE "D"."tranNo" = "H"."tranNo"
>> AND "D"."countyNo" = "H"."countyNo"
>> AND "D"."caseNo" LIKE '2006TR%'
>> )
>> ;
>
>> The commercial product scans the index on caseNo in TranDetail to build a
> work table of unique values, then uses indexed access to the TranHeader and
> then to Adjustment.
>
> If you want that, try rewriting the EXISTS to an IN:
>
> AND ("H"."tranNo", "H"."countyNo") IN
> (
> SELECT "D"."tranNo", "D"."countyNo" FROM "TranDetail" "D"
> WHERE "D"."caseNo" LIKE '2006TR%'
> )

Nice. I get this:

explain analyze
SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time"
FROM "Adjustment" "A"
JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
WHERE "H"."tranType" = 'A'
AND "A"."date" > DATE '2006- 01- 01'
AND "H"."countyNo" = 66
AND "A"."countyNo" = 66
AND ("H"."tranNo", "H"."countyNo") IN
(
SELECT "D"."tranNo", "D"."countyNo" FROM "TranDetail" "D"
WHERE "D"."caseNo" LIKE '2006TR%'
)
;

Nested Loop (cost=27.76..36.38 rows=1 width=46) (actual time=92.999..200.398 rows=2209 loops=1)
Join Filter: (("H"."tranNo")::integer = ("A"."tranNo")::integer)
-> Nested Loop (cost=27.76..32.08 rows=1 width=50) (actual time=92.970..176.472 rows=2209 loops=1)
-> HashAggregate (cost=27.76..27.77 rows=1 width=6) (actual time=92.765..100.810 rows=9788 loops=1)
-> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..27.66 rows=20 width=6) (actual time=0.059..60.967 rows=46301 loops=1)
Index Cond: ((("caseNo")::bpchar >= '2006TR'::bpchar) AND (("caseNo")::bpchar < '2006TS'::bpchar) AND (("countyNo")::smallint = 66))
Filter: (("caseNo")::bpchar ~~ '2006TR%'::text)
-> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..4.30 rows=1 width=46) (actual time=0.006..0.006 rows=0 loops=9788)
Index Cond: ((("H"."tranNo")::integer = ("D"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))
Filter: (("tranType")::bpchar = 'A'::bpchar)
-> Index Scan using "Adjustment_pkey" on "Adjustment" "A" (cost=0.00..4.28 rows=1 width=22) (actual time=0.008..0.009 rows=1 loops=2209)
Index Cond: ((("H"."tranId")::bpchar = ("A"."adjustmentNo")::bpchar) AND (("A"."countyNo")::smallint = 66))
Filter: ((date)::date > '2006-01-01'::date)
Total runtime: 201.306 ms

That's the good news. The bad news is that I operate under a management portability dictate which doesn't currently allow that syntax, since not all of the products they want to cover support it. I tried something which seems equivalent, but it is running for a very long time. I'll show it with just the explain while I wait to see how long the explain analyze takes.

explain
SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time"
FROM "Adjustment" "A"
JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
WHERE "H"."tranType" = 'A'
AND "A"."date" > DATE '2006- 01- 01'
AND "H"."countyNo" = 66
AND "A"."countyNo" = 66
AND "H"."tranNo" IN
(
SELECT "D"."tranNo" FROM "TranDetail" "D"
WHERE "D"."caseNo" LIKE '2006TR%'
AND "D"."countyNo" = "H"."countyNo"
)
;

Nested Loop (cost=0.00..181673.08 rows=1 width=46)
Join Filter: (("H"."tranId")::bpchar = ("A"."adjustmentNo")::bpchar)
-> Seq Scan on "Adjustment" "A" (cost=0.00..2384.27 rows=11733 width=22)
Filter: (((date)::date > '2006-01-01'::date) AND (("countyNo")::smallint = 66))
-> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..15.27 rows=1 width=46)
Index Cond: ((("H"."tranNo")::integer = ("A"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))
Filter: ((("tranType")::bpchar = 'A'::bpchar) AND (subplan))
SubPlan
-> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..27.66 rows=20 width=4)
Index Cond: ((("caseNo")::bpchar >= '2006TR'::bpchar) AND (("caseNo")::bpchar < '2006TS'::bpchar) AND (("countyNo")::smallint = ($0)::smallint))
Filter: (("caseNo")::bpchar ~~ '2006TR%'::text)

> We don't currently try to flatten EXISTS into a unique/join plan as we
> do for IN. I seem to recall not doing so when I rewrote IN planning
> because I didn't think it would be exactly semantically equivalent,
> but that was awhile ago. Right at the moment it seems like it ought
> to be equivalent as long as the comparison operators are strict.

There are a great many situations where they are exactly semantically equivalent. In fact, the commercial database product usually generates an identical plan. I could try to work out (or better yet find) a formal description of when that equivalence holds, if someone would be up for implementing it. Barring that, I could see if management would approve some time for me to look at submitting a patch, but I haven't looked at the code involved, so I have no idea of the scale of effort involved yet.

-Kevin


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Andrea Olson <Andrea(dot)Olson(at)wicourts(dot)gov>, Bill Severson <Bill(dot)Severson(at)wicourts(dot)gov>, John Hutchins <John(dot)Hutchins(at)wicourts(dot)gov>, Randy Peterson <Randy(dot)Peterson(at)wicourts(dot)gov>, Shannon Spranger <Shannon(dot)Spranger(at)wicourts(dot)gov>
Subject: Re: EXISTS optimization
Date: 2007-03-23 22:26:41
Message-ID: 20070323222641.GD9623@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Mar 23, 2007 at 05:49:42PM -0400, Tom Lane wrote:
> We don't currently try to flatten EXISTS into a unique/join plan as we
> do for IN. I seem to recall not doing so when I rewrote IN planning
> because I didn't think it would be exactly semantically equivalent,
> but that was awhile ago. Right at the moment it seems like it ought
> to be equivalent as long as the comparison operators are strict.

Wasn't it due to the fact that IN needs to scan through all
possibilites anyway because of its interaction with NULL, whereas
EXISTS can stop at the first row?

That would mean the subquery to be materialised would not be equivalent
if it called any non-immutable functions. It's also much less clear to
be a win in the EXISTs case. But then, that's a costs issue the planner
can deal with...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: <pgsql-hackers(at)postgresql(dot)org>,<pgsql-performance(at)postgresql(dot)org>, "Andrea Olson" <Aolson(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Bill Severson" <BSEVERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "John Hutchins" <jhutchi(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Randy Peterson" <RPETERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Shannon Spranger" <ssprang(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: EXISTS optimization
Date: 2007-03-23 22:30:27
Message-ID: 46040EB2.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

I don't understand -- TRUE OR UNKNOWN evaluates to TRUE, so why would the IN need to continue? I'm not quite following the rest; could you elaborate or give an example? (Sorry if I'm lagging behind the rest of the class here.)

-Kevin


>>> Martijn van Oosterhout <kleptog(at)svana(dot)org> 03/23/07 5:26 PM >>>
On Fri, Mar 23, 2007 at 05:49:42PM -0400, Tom Lane wrote:
> We don't currently try to flatten EXISTS into a unique/join plan as we
> do for IN. I seem to recall not doing so when I rewrote IN planning
> because I didn't think it would be exactly semantically equivalent,
> but that was awhile ago. Right at the moment it seems like it ought
> to be equivalent as long as the comparison operators are strict.

Wasn't it due to the fact that IN needs to scan through all
possibilites anyway because of its interaction with NULL, whereas
EXISTS can stop at the first row?

That would mean the subquery to be materialised would not be equivalent
if it called any non-immutable functions. It's also much less clear to
be a win in the EXISTs case. But then, that's a costs issue the planner
can deal with...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Cc: <pgsql-hackers(at)postgresql(dot)org>,<pgsql-performance(at)postgresql(dot)org>, "Andrea Olson" <Aolson(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Bill Severson" <BSEVERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "John Hutchins" <jhutchi(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Randy Peterson" <RPETERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Shannon Spranger" <ssprang(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: [PERFORM] EXISTS optimization
Date: 2007-03-23 22:37:16
Message-ID: 4604104C.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

>>> On Fri, Mar 23, 2007 at 5:26 PM, in message
<46040DAC(dot)EE98(dot)0025(dot)0(at)wicourts(dot)gov>, "Kevin Grittner"
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> I tried something which seems
> equivalent, but it is running for a very long time. I'll show it with just
> the explain while I wait to see how long the explain analyze takes.
>
> explain
> SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date",
> "H"."userId", "H"."time"
> FROM "Adjustment" "A"
> JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND
> "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo")
> WHERE "H"."tranType" = 'A'
> AND "A"."date" > DATE '2006- 01- 01'
> AND "H"."countyNo" = 66
> AND "A"."countyNo" = 66
> AND "H"."tranNo" IN
> (
> SELECT "D"."tranNo" FROM "TranDetail" "D"
> WHERE "D"."caseNo" LIKE '2006TR%'
> AND "D"."countyNo" = "H"."countyNo"
> )
> ;

explain analyze results:

Nested Loop (cost=0.00..181673.08 rows=1 width=46) (actual time=42224.077..964266.969 rows=2209 loops=1)
Join Filter: (("H"."tranId")::bpchar = ("A"."adjustmentNo")::bpchar)
-> Seq Scan on "Adjustment" "A" (cost=0.00..2384.27 rows=11733 width=22) (actual time=15.355..146.620 rows=13003 loops=1)
Filter: (((date)::date > '2006-01-01'::date) AND (("countyNo")::smallint = 66))
-> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..15.27 rows=1 width=46) (actual time=74.141..74.141 rows=0 loops=13003)
Index Cond: ((("H"."tranNo")::integer = ("A"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))
Filter: ((("tranType")::bpchar = 'A'::bpchar) AND (subplan))
SubPlan
-> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..27.66 rows=20 width=4) (actual time=0.039..58.234 rows=42342 loops=13003)
Index Cond: ((("caseNo")::bpchar >= '2006TR'::bpchar) AND (("caseNo")::bpchar < '2006TS'::bpchar) AND (("countyNo")::smallint = ($0)::smallint))
Filter: (("caseNo")::bpchar ~~ '2006TR%'::text)
Total runtime: 964269.555 ms


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, "Andrea Olson" <Aolson(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Bill Severson" <BSEVERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "John Hutchins" <jhutchi(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Randy Peterson" <RPETERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Shannon Spranger" <ssprang(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: EXISTS optimization
Date: 2007-03-23 23:04:12
Message-ID: 26175.1174691052@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> If you want that, try rewriting the EXISTS to an IN:
>>
>> AND ("H"."tranNo", "H"."countyNo") IN
>> (
>> SELECT "D"."tranNo", "D"."countyNo" FROM "TranDetail" "D"
>> WHERE "D"."caseNo" LIKE '2006TR%'
>> )

> That's the good news. The bad news is that I operate under a
> management portability dictate which doesn't currently allow that
> syntax, since not all of the products they want to cover support it.

Which part of it don't they like --- the multiple IN-comparisons?

> I tried something which seems equivalent, but it is running for a very
> long time.
> AND "H"."tranNo" IN
> (
> SELECT "D"."tranNo" FROM "TranDetail" "D"
> WHERE "D"."caseNo" LIKE '2006TR%'
> AND "D"."countyNo" = "H"."countyNo"
> )

No, that's not gonna accomplish a darn thing, because you've still got
a correlated subquery (ie, a reference to outer "H") and so turning the
IN into a join doesn't work.

regards, tom lane


From: "Peter Kovacs" <peter(dot)kovacs(dot)1(dot)0rc(at)gmail(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, "Andrea Olson" <Aolson(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Bill Severson" <BSEVERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "John Hutchins" <jhutchi(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Randy Peterson" <RPETERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Shannon Spranger" <ssprang(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: [PERFORM] EXISTS optimization
Date: 2007-03-23 23:04:44
Message-ID: b6e8f2e80703231604v72b9dc4dr51eebd62274d53ec@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 3/23/07, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
[...]
> That's the good news. The bad news is that I operate under a management portability dictate which doesn't currently allow that syntax, since not all of the products they want to

It doesn't really touch the substance, but I am curious: are you not
even allowed to discriminate between products in your code like:
if db is 'postresql' then
...
else
...
?

What would be the rationale for that?

Thanks
Peter

cover support it. I tried something which seems equivalent, but it is
running for a very long time. I'll show it with just the explain
while I wait to see how long the explain analyze takes.
>
[...]


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Peter Kovacs" <peter(dot)kovacs(dot)1(dot)0rc(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>,<pgsql-performance(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andrea Olson" <Aolson(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Bill Severson" <BSEVERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "John Hutchins" <jhutchi(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Randy Peterson" <RPETERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, "Shannon Spranger" <ssprang(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: [PERFORM] EXISTS optimization
Date: 2007-03-24 02:28:36
Message-ID: 46044683.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

>>> On Fri, Mar 23, 2007 at 6:04 PM, in message
<b6e8f2e80703231604v72b9dc4dr51eebd62274d53ec(at)mail(dot)gmail(dot)com>, "Peter Kovacs"
<peter(dot)kovacs(dot)1(dot)0rc(at)gmail(dot)com> wrote:
> On 3/23/07, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> [...]
>> That's the good news. The bad news is that I operate under a management
> portability dictate which doesn't currently allow that syntax, since not all
> of the products they want to
>
> It doesn't really touch the substance, but I am curious: are you not
> even allowed to discriminate between products in your code like:
> if db is 'postresql' then
> ...
> else
> ...
> ?
>
> What would be the rationale for that?

Anybody who's not curious about that should skip the rest of this email.

Management has simply given a mandate that the software be independent of OS and database vendor, and to use Java to help with the OS independence. I have to admit that I am the architect of the database independence solution that was devised. (The choice of Java for the OS independence has been very successful. We have run our bytecode on HP-UX, Windows, Sun Solaris, and various flavors of Linux without having to compile different versions of the bytecode. Other than when people get careless with case sensitivity on file names or with path separators, it just drops right in and runs.

For the data side, we write all of our queries in ANSI SQL in our own query tool, parse it, and generate Java classes to run it. The ANSI source is broken down to "lowest common denominator" queries, with all procedural code covered in the Java query classes. So we have stored procedures which can be called, triggers that fire, etc. in Java, issuing SELECT, INSERT, UPDATE, DELETE statements to the database. This allows us to funnel all DML through a few "primitive" routines which capture before and after images and save them in our own transaction image tables. We use this to replicate from our 72 county databases, which are the official court record, to multiple central databases, and a transaction repository, used for auditing case activity and assisting with failure recovery.

The problem with burying 'if db is MaxDB', 'if db is SQLServer', 'if db is PostgreSQL' everywhere is that you have no idea what to do when you then want to drop in some different product. We have a plugin layer to manage known areas of differences which aren't handled cleanly by JDBC, where the default behavior is ANSI-compliant, and a few dozen to a few hundred lines need to be written to modify that default support a new database product. (Of course, each one so far has brought in a few surprises, making the plugin layer just a little bit thicker.)

So, to support some new syntax, we have to update our parser, and have a way to generate code which runs on all the candidate database products, either directly or through a plugin layer. If any of the products don't support multi-value row value constructors, I have a hard time seeing a good way to cover that with the plugin. On the subject issue, I'm pretty sure it would actually be less work for me to modify the PostgreSQL optimizer to efficiently handle the syntax we do support than to try to bend row value constructors to a syntax that is supported on other database products.

And, by the way, I did take a shot on getting them to commit to PostgreSQL as the long-term solution, and relax the portability rules. No sale. Perhaps when everything is converted to PostgreSQL and working for a while they may reconsider.

-Kevin


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org, Andrea Olson <Aolson(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, Bill Severson <BSEVERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, John Hutchins <jhutchi(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, Randy Peterson <RPETERS(dot)CCAP(dot)Courts(at)wicourts(dot)gov>, Shannon Spranger <ssprang(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: EXISTS optimization
Date: 2007-03-24 13:07:59
Message-ID: 20070324130759.GA1089@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Mar 23, 2007 at 05:30:27PM -0500, Kevin Grittner wrote:
> I don't understand -- TRUE OR UNKNOWN evaluates to TRUE, so why would
> the IN need to continue? I'm not quite following the rest; could you
> elaborate or give an example? (Sorry if I'm lagging behind the rest
> of the class here.)

You're right, I'm getting confused with the interaction of NULL and NOT
IN.

The multiple evaluation thing still applies, but that's minor.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: [HACKERS] EXISTS optimization
Date: 2007-04-03 21:47:30
Message-ID: 4612CB72.6010804@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Kevin Grittner wrote:
> Management has simply given a mandate that the software be independent
> of OS and database vendor, and to use Java to help with the OS independence.
> ... we write all of our queries in ANSI SQL in our own query tool, parse it,
> and generate Java classes to run it.

A better solution, and one I've used for years, is to use OS- or database-specific features, but carefully encapsulate them in a single module, for example, "database_specific.java".

For example, when I started supporting both Oracle and Postgres, I encountered the MAX() problem, which (at the time) was very slow in Postgres, but could be replaced by "select X from MYTABLE order by X desc limit 1". So I created a function, "GetColumnMax()" that encapsulates the database-specific code for this. Similar functions encapsulate and a number of other database-specific optimizations.

Another excellent example: I have a function called "TableExists(name)". To the best of my knowledge, there simply is no ANSI SQL for this, so what do you do? Encapsulate it in one place.

The result? When I port to a new system, I know exactly where to find all of the non-ANSI SQL. I started this habit years ago with C/C++ code, which has the same problem: System calls are not consistent across the varients of Unix, Windows, and other OS's. So you put them all in one file called "machine_dependent.c".

Remember the old adage: There is no such thing as portable code, only code that has been ported.

Cheers,
Craig