Re: Regression in IN( field, field, field ) performance

Lists: pgsql-hackers
From: Jim 'Decibel!' Nasby <jnasby(at)cashnetusa(dot)com>
To: Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Regression in IN( field, field, field ) performance
Date: 2008-10-21 15:45:43
Message-ID: D8ED6CE5-3A4B-496B-A47B-91A0ABF10EFE@cashnetusa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

WHERE '12814474045' IN (people.home_phone, people.work_phone,
people.mobile_phone)

Yeah, not exactly a common case, but at least in 8.1 this was turned
into a set of ORs. Starting in 8.2 and in current HEAD, the planner
turns that into:

Filter: ('12814474045'::text = ANY ((ARRAY[home_phone, mobile_phone,
work_phone])::text[]))

Which means automatic seqscan. Would it be difficult to teach the
planner to handle this case differently? I know it's probably not
terribly common, but it is very useful.
--
Decibel! jnasby(at)cashnetusa(dot)com (512) 569-9461


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim 'Decibel!' Nasby" <jnasby(at)cashnetusa(dot)com>
Cc: Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-21 17:06:05
Message-ID: 27464.1224608765@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jim 'Decibel!' Nasby" <jnasby(at)cashnetusa(dot)com> writes:
> WHERE '12814474045' IN (people.home_phone, people.work_phone,
> people.mobile_phone)

> Yeah, not exactly a common case, but at least in 8.1 this was turned
> into a set of ORs. Starting in 8.2 and in current HEAD, the planner
> turns that into:

> Filter: ('12814474045'::text = ANY ((ARRAY[home_phone, mobile_phone,
> work_phone])::text[]))

> Which means automatic seqscan.

It means no such thing.

regards, tom lane


From: Decibel! <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-21 18:30:53
Message-ID: 75B003A6-F923-4DFD-AA32-6AA6A35D1D2C@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 21, 2008, at 12:06 PM, Tom Lane wrote:
> "Jim 'Decibel!' Nasby" <jnasby(at)cashnetusa(dot)com> writes:
>> WHERE 'xxxxxxxxxxx' IN (people.home_phone, people.work_phone,
>> people.mobile_phone)
>
>> Yeah, not exactly a common case, but at least in 8.1 this was turned
>> into a set of ORs. Starting in 8.2 and in current HEAD, the planner
>> turns that into:
>
>> Filter: ('xxxxxxxxxxx'::text = ANY ((ARRAY[home_phone, mobile_phone,
>> work_phone])::text[]))
>
>> Which means automatic seqscan.
>
> It means no such thing.

It won't use an index scan on this query while it's in that form
(even with enable_seqscan=off), but if I change it to a bunch of OR'd
conditions it will switch to bitmap scans. The estimated cost with
the seqscans is about 2x more expensive.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-21 20:47:26
Message-ID: 8930.1224622046@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! <decibel(at)decibel(dot)org> writes:
> On Oct 21, 2008, at 12:06 PM, Tom Lane wrote:
>> "Jim 'Decibel!' Nasby" <jnasby(at)cashnetusa(dot)com> writes:
>>> Filter: ('xxxxxxxxxxx'::text = ANY ((ARRAY[home_phone, mobile_phone,
>>> work_phone])::text[]))
>>
> Which means automatic seqscan.
>>
>> It means no such thing.

> It won't use an index scan on this query while it's in that form
> (even with enable_seqscan=off), but if I change it to a bunch of OR'd
> conditions it will switch to bitmap scans.

Works fine for me, eg

regression=# explain select * from tenk1 a, tenk1 b where
regression-# b.unique2 = any(array[a.unique1,a.ten,a.hundred]);
QUERY PLAN

--------------------------------------------------------------------------------
--
Nested Loop (cost=0.79..49047.50 rows=29997 width=488)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244)
-> Bitmap Heap Scan on tenk1 b (cost=0.79..4.82 rows=3 width=244)
Recheck Cond: (b.unique2 = ANY (ARRAY[a.unique1, a.ten, a.hundred]))
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..0.79 rows=3 width=0
)
Index Cond: (b.unique2 = ANY (ARRAY[a.unique1, a.ten, a.hundred])
)
(6 rows)

You'll need to provide a concrete test case if you think there's
something broken here.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Decibel! <decibel(at)decibel(dot)org>, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-21 22:54:20
Message-ID: 87wsg1sgo3.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Works fine for me, eg
...
> -> Bitmap Heap Scan on tenk1 b (cost=0.79..4.82 rows=3 width=244)
> Recheck Cond: (b.unique2 = ANY (ARRAY[a.unique1, a.ten, a.hundred]))
> -> Bitmap Index Scan on tenk1_unique2 (cost=0.00..0.79 rows=3 width=0
> )
> Index Cond: (b.unique2 = ANY (ARRAY[a.unique1, a.ten, a.hundred])

But that's an index on the lhs of the =ANY which in his example was just a
constant.

> You'll need to provide a concrete test case if you think there's
> something broken here.

I think he's looking for something like:

5 IN (col1,col2,col3)

resulting in a bitmap or of three index scans of three different indexes on
col1, col2, and col3.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Decibel! <decibel(at)decibel(dot)org>, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-23 16:16:16
Message-ID: 10073.1224778576@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Works fine for me, eg

> I think he's looking for something like:
> 5 IN (col1,col2,col3)
> resulting in a bitmap or of three index scans of three different indexes on
> col1, col2, and col3.

Ah, I see. It would be easy to make transformAExprIn() generate an OR
tree instead of = ANY(ARRAY[]), if we could figure out the conditions
where an OR tree is superior. I'm not sure it's easy to tell though.
Is it sufficient to do this when there are Vars on the right side and
none on the left?

regards, tom lane


From: Decibel! <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-25 05:32:30
Message-ID: D28FCC7D-D2EA-441D-8921-8214EC8B3BC7@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 23, 2008, at 11:16 AM, Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>> Works fine for me, eg
>
>> I think he's looking for something like:
>> 5 IN (col1,col2,col3)
>> resulting in a bitmap or of three index scans of three different
>> indexes on
>> col1, col2, and col3.
>
> Ah, I see. It would be easy to make transformAExprIn() generate an OR
> tree instead of = ANY(ARRAY[]), if we could figure out the conditions
> where an OR tree is superior. I'm not sure it's easy to tell though.
> Is it sufficient to do this when there are Vars on the right side and
> none on the left?

There's 6 cases here, in a 2x3 array. In one dimension, the LHS can
be either a Var or a fixed value. In the other dimension, the three
possibilities are 1: everything on the RHS is a fixed value, 2: some
fixed, some not, 3: everything on the RHS is a variable:

1 2 3
------ Right Hand Side -------
A: LHS fixed All fixed Mixture All var.
B: LHS var. All fixed Mixture All var.

For A2 and A3, an OR is probably best. There's no way I can think of
to optimize A3 with an array, and with A2 you could get lucky and hit
something like 1 = 1. Hopefully the planner would check all the fixed
cases first.

For A1, an array might be best; it depends on if it's cheaper to
build a huge OR clause and evaluate, or to iterate through the array,
and that could depend on the number of terms.

B1 might actually be similar to A1... was testing done to see if ORs
were faster for a small number of elements?

For B3, the only use-case I can think of is comparing fields within a
record, and I can't see that resulting in a really large number of
terms (which would presumabbly favor an array). But if you turned it
into ORs, the planner could decide that it's better to use an index
on some/all of the terms on the RHS. That could end up being far
faster than using an array. An example would be field_in_small_table
IN ( field_a_in_large_table, field_b_in_large_table,
field_c_in_large_table ).

One final note: A2 and B2 could be treated as a combination. Treat
all the RHS fixed values as you would A1/B1, treat all the RHS
variables as you would A3/B3, and OR the results.

Ideally, the planner would understand the costs associated with how
many terms are involved and would act accordingly. But I don't know
that we can make it accurate enough to do that.

I think that the A3 and B3 cases should always be OR'd. Treating as
an array just ties the planner's hands too much.

Presumably A1/B1 should be done with arrays, otherwise we wouldn't
have moved away from ORs to begin with.

That leaves the mixed RHS case. If it's cheap to just split things
into two piles (fixed RHS vs variable RHS) then that's probably the
way to go. Ideally, each condition would then be estimated
separately, and the executor would favor executing the cheaper one
first.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-25 18:31:06
Message-ID: 603c8f070810251131m4e9cf4efl8c6e877d1fbd2022@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> There's 6 cases here, in a 2x3 array. In one dimension, the LHS can be
> either a Var or a fixed value. In the other dimension, the three
> possibilities are 1: everything on the RHS is a fixed value, 2: some fixed,
> some not, 3: everything on the RHS is a variable:
[...lengthy discussion of cases...]

It seems like you're saying that the only time the array wins here is
when you're comparing an expression against a whole bunch of
constants. Given that, would it make any sense to put all the
constants in an ARRAY and use OR for any variables? i.e. given

Whatever IN (Const1, Var1, Const2, Var2, Const3, Var3)

Generate:

Whatever = ANY (ARRAY[Const1, Const2, Const3]) OR Whatever = Var1 OR
Whatever = Var2 OR Whatever = Var3

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: Decibel! <decibel(at)decibel(dot)org>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-26 01:55:47
Message-ID: 1218.1224986147@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> Given that, would it make any sense to put all the
> constants in an ARRAY and use OR for any variables? i.e. given

> Whatever IN (Const1, Var1, Const2, Var2, Const3, Var3)

> Generate:

> Whatever = ANY (ARRAY[Const1, Const2, Const3]) OR Whatever = Var1 OR
> Whatever = Var2 OR Whatever = Var3

Yeah, I like this --- it seems a bit more principled/easier to
understand than just arbitrarily switching between an all-ARRAY
and a no-ARRAY formulation, which is what I put in earlier today.
I'll go change that ...

regards, tom lane