Re: slow IN() clause for many cases

From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: Ilia Kantor <ilia(at)obnovlenie(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: slow IN() clause for many cases
Date: 2005-10-12 19:29:27
Message-ID: 36e682920510121229x2685b577y59ffb31b70fffd65@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

IMHO, we should first look at whether the O(N^2) behavior is needed.

On 10/12/05, Ilia Kantor <ilia(at)obnovlenie(dot)ru> wrote:
>
>
> 1) merge join is faster for me then hash join (6 vs 3-4ms):
>
> explain analyze select * from objects_hier where id in (select
>
> (array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
> 27,28,29,30])[i] from generate_series(1,30) s(i));
>
> Hash Join (cost=17.50..162.93 rows=223 width=600) (actual
> time=1.148..32.654 rows=138 loops=1)
> Hash Cond: ("outer".id =
>
> ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,2
> 8,29,30}'::integer[])["inner".i])
> -> Seq Scan on objects_hier (cost=0.00..134.80 rows=1680 width=600)
> (actual time=0.235..12.271 rows=1680 loops=1)
> -> Hash (cost=17.00..17.00 rows=200 width=4) (actual time=0.893..0.893
> rows=30 loops=1)
> -> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual
> time=0.161..0.830 rows=30 loops=1)
> -> Function Scan on generate_series s (cost=0.00..12.50
> rows=1000 width=4) (actual time=0.037..0.097 rows=30 loops=1)
>
> 2) Is it possible to make the planner plan that better, e.g use Bitmap-OR
> for <20numbers, use some other method for >20 numbers (actual number
> depends
> on costs) ?
>
> Maybe that should be added to TODO ?
>
>
>
>
> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Andrew -
> Supernews
> Sent: Wednesday, October 12, 2005 1:41 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] slow IN() clause for many cases
>
> On 2005-10-11, "Ilia Kantor" <ilia(at)obnovlenie(dot)ru> wrote:
> > When in clause becomes large enough (>20-30 cases),
> > It is much better to use "join" way of processing..
>
> or even a different way of writing the IN clause.
>
> This one is one I've used after considerable research:
>
> select * from table
> where field in (select (some_array_of_N_items)[i]
> from generate_series(1,N) as s(i));
>
> This generally plans as a nestloop, with a HashAggregate of the function
> scan (of generate_series) on the outer path, and index lookups on the
> inner
> path.
>
> It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
> comparing queries of this kind. The IN (1,2,...30) form is much slower to
> plan, and usually can't usefully be used in prepared form (you'd need to
> prepare it separately for every different number of items); in contrast,
> the array version can be prepared once and reused.
>
> As the number of items in the IN clause increases, the planning time grows
> rather radically. For example with 1000 items (here stashed in a psql
> convenience variable for brevity), using 8.1beta2:
>
> test=# prepare tstplan1 as select * from test where id in (:v);
> Time: 4881.702 ms
>
> compare:
>
> test=# prepare tstplan2 as select * from test where id in
> (select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
> Time: 10.889 ms
>
> (on my machine the break-even point for these two is less than 20 items,
> or even less if the array is passed in as a literal or a single parameter
> rather than constructed with ARRAY[].)
>
> The actual execution time of these two is very close, with the second
> being about 10% slower on my system (31ms vs 34ms, based on \timing values
> from psql and averaged over several goes). However, the timings returned
> from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
> "total runtime" line. So not only is the planning time different, but also
> the instrumentation overhead of EXPLAIN ANALYZE is wildly different
> between
> the two forms.
>
> What this means is that unless you're going to prepare in advance every
> possible number of parameters to IN that your app is ever going to use,
> the only way to get useful performance for IN queries with more than a
> handful of literal values is to use an array method, in spite of the fact
> that the bitmap-OR execution plan is actually at least as fast.
>
> --
> Andrew, Supernews
> http://www.supernews.com - individual and corporate NNTP services
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2005-10-12 19:54:15 Are cost estimates based on asserts?
Previous Message Emil Briggs 2005-10-12 19:07:23 Re: Spinlocks, yet again: analysis and proposed patches