Re: slow IN() clause for many cases

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: andrew(at)supernews(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: slow IN() clause for many cases
Date: 2005-10-14 22:55:10
Message-ID: 7976.1129330510@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew - Supernews <andrew+nonews(at)supernews(dot)com> writes:
> With everything in cache, selecting 1000 random rows from a 200k row table,
> I get:
>
> for IN (list): planning time 47ms, execution time 27ms
> array/nestloop: planning time 8ms, execution time 34ms

The reason for the slow planning time is that IN (list) is currently
converted (by the grammar) into an OR structure:

(x = val1) OR (x = val2) OR (x = val3) OR (x = val4) OR ...

which means that the planner has to consider each subclause separately:
discover that it can be matched to an index, build the indexscan plan,
etc. Even just looking up all the "=" operators during parsing takes a
fair amount of time :-( The large number of plan nodes also imply
relatively slow executor startup.

It strikes me that now that we have the bitmap indexscan mechanism,
it wouldn't be hard to do better. I'm thinking that IN should be
converted to a ScalarArrayOpExpr, ie

x = ANY (ARRAY[val1,val2,val3,val4,...])

and then we could treat both OpExpr and ScalarArrayOpExpr as matching
an index when the LHS is the index key and the operator is in the
index's opclass. This wouldn't fit comfortably into a plain indexscan,
but a bitmap indexscan has already got the mechanism for ORing together
the results of several primitive indexscans (one per array element).
You just do the scans and insert all the results into your output
bitmap.

This approach would reduce the planner and executor-startup costs of
even a long IN list to be pretty comparable to a single index key
comparison. The actual runtime of the plan wouldn't change much,
I think, but it's the overhead that's hurting us here.

This also solves the problem mentioned earlier in the thread that
you can't make a prepared plan for varying lengths of IN-lists:
instead of using IN, you do something like
x = ANY($1::int[])
and then ship your lookup keys as a single array parameter instead of
multiple scalars.

The main objection I can see is that this technique requires the
elements of the IN list to be converted to a common datatype (ie, the
element datatype of the ARRAY construct). Historically our code has
made no such assumption. However, I believe that the SQL standard
expects that conversion to happen: "x IN (list)" is defined as
"x = ANY (VALUES list)" and <table value constructor> expects all its
rows to be converted to a common rowtype. So we could point to the
standard if anyone complains that the behavior changed.

Obviously this is not happening for 8.1, but it seems very do-able for
8.2.

Comments?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-10-14 23:09:17 Re: slow IN() clause for many cases
Previous Message Neil Conway 2005-10-14 21:42:44 Re: Open items