Quick Links

Re: Optimizer choosing smaller index instead of right one

Lists:	pgsql-general

From:	Ian Ribas <ian(dot)ribas(at)digitro(dot)com(dot)br>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Optimizer choosing smaller index instead of right one
Date:	2004-04-28 12:11:36
Message-ID:	408F9F78.7000508@digitro.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Hello,

This is probably a common problem, but I couldn't really find a direct answer in
the archives (or maybe just couldn't find one that satisfied me ;-). I created
an index specifically to help a query and the optimizer does not use it. It
prefers an older index that has one less column, but that yields much poorer
performance.

The query get the smallest date in a period, for some criteria. The one table
used in the query has 604494 rows. The index I created to try to optimize it is
"xie3cham_servico", and has all the three columns of the table that are used in
the query. But the index being used is actually xie2cham_servico (as can be seen
below) which has only two columns, one of them completely useless for the query!
I'm probably missing some setting, but I couldn't figure out which.

One last comment, if I test this on a server all by myself, after some test
(when, I imagine, all related disk pages are on cache) the time difference
between the two situations is much smaller (4secs against 2 secs), but, of
course, this doesn't help me much ...

Here is the associated information:

explain analyze SELECT dt_finalizacao FROM cham_servico WHERE id_chave_grupo =
'7458' AND cod_bxs = 1 AND dt_finalizacao > to_timestamp
('2004-04-20','YYYY-MM-DD') AND dt_finalizacao < now() ORDER BY dt_finalizacao
ASC LIMIT 1;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=15663.77..15663.77 rows=1 width=8) (actual time=14002.66..14002.66
rows=1 loops=1)
-> Sort (cost=15663.77..15663.83 rows=26 width=8) (actual
time=14002.65..14002.65 rows=2 loops=1)
Sort Key: dt_finalizacao
-> Index Scan using xie2cham_servico on cham_servico
(cost=0.00..15663.15 rows=26 width=8) (actual time=13604.37..13980.16 rows=3173
loops=1)
Index Cond: (id_chave_grupo = '7458'::character varying)
Filter: ((cod_bxs = 1) AND ((dt_finalizacao)::timestamp with time
zone > to_timestamp('2004-04-20'::text, 'YYYY-MM-DD'::text)) AND
((dt_finalizacao)::timestamp with time zone < now()))
Total runtime: 45014.06 msec
(7 rows)

Time: 45105,73 ms

If I "drop index xie2cham_servico;", I get :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=17600.92..17600.92 rows=1 width=8) (actual time=8750.64..8750.64
rows=1 loops=1)
-> Sort (cost=17600.92..17600.99 rows=26 width=8) (actual
time=8750.63..8750.63 rows=2 loops=1)
Sort Key: dt_finalizacao
-> Index Scan using xie3cham_servico on cham_servico
(cost=0.00..17600.30 rows=26 width=8) (actual time=8432.37..8743.29 rows=3176
loops=1)
Index Cond: ((id_chave_grupo = '7458'::character varying) AND
(cod_bxs = 1))
Filter: (((dt_finalizacao)::timestamp with time zone >
to_timestamp('2004-04-20'::text, 'YYYY-MM-DD'::text)) AND
((dt_finalizacao)::timestamp with time zone < now()))
Total runtime: 8751.50 msec
(7 rows)

Time: 8758,36 ms

Some more environment info:

Postgres 7.3.4 on Linux Redhat 9.

show all:

From:	Ian Ribas <ian(dot)ribas(at)digitro(dot)com(dot)br>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Optimizer choosing smaller index instead of right one
Date:	2004-04-28 12:14:03
Message-ID:	408FA00B.5070407@digitro.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Please ignore my previous message, its a duplicate.Sorry for the
inconvenience!

Ian.
Ian Ribas wrote:

> ....

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Ian Ribas <ian(dot)ribas(at)digitro(dot)com(dot)br>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Optimizer choosing smaller index instead of right one
Date:	2004-04-30 00:15:32
Message-ID:	18452.1083284132@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Ian Ribas <ian(dot)ribas(at)digitro(dot)com(dot)br> writes:
> The index I created to try to optimize it is "xie3cham_servico", and
> has all the three columns of the table that are used in the query. But
> the index being used is actually xie2cham_servico (as can be seen
> below) which has only two columns, one of them completely useless for
> the query! I'm probably missing some setting, but I couldn't figure
> out which.

The optimizer's cost equations do say that a smaller index should be
preferred over a larger one if the number of rows obtainable from the
index is comparable (which I trust you will agree is reasonable).
I think the real issue here is a datatype conflict. Look closely
at the filter conditions:

> -> Index Scan using xie2cham_servico on cham_servico (cost=0.00..15663.15 rows=26 width=8) (actual time=13604.37..13980.16 rows=3173 loops=1)
> Index Cond: (id_chave_grupo = '7458'::character varying)
> Filter: ((cod_bxs = 1) AND ((dt_finalizacao)::timestamp with time zone > to_timestamp('2004-04-20'::text, 'YYYY-MM-DD'::text)) AND ((dt_finalizacao)::timestamp with time zone < now()))

The two comparisons on dt_finalizacao are unindexable because they
coerce it to a different datatype; the "<" and ">" are the operators on
timestamp with time zone and so they are not relevant to an index on
timestamp without time zone.

Very likely your column really ought to be of type timestamp with time
zone. If you are convinced it should be timestamp without time zone
then you need to fix the values you are comparing to.

I'm also a bit disturbed by the fact that the rows estimates are off by
more than two orders of magnitude (26 est vs 3173 actual). It would
perhaps help if you wrote the boundary date as a simple constant:

... AND dt_finalizacao > '2004-04-20' AND ...

so that the planner had some chance of comparing the constant to its
statistics about the distribution of dt_finalizacao. to_timestamp is
not a constant-foldable function (because it depends on the current
TimeZone setting) and so the query as you've written it looks like

... AND dt_finalizacao > unpredictable_expression AND ...

as far as the planner knows. (I'm not sure this will help though;
it may be that the big problem is that the three columns are not
independent, which is something the planner doesn't know because it has
no cross-column statistics.)

regards, tom lane

From:	Ian Ribas <ian(dot)ribas(at)digitro(dot)com(dot)br>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Optimizer choosing smaller index instead of right one
Date:	2004-04-30 17:08:08
Message-ID:	409287F8.5040209@digitro.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Effectively, I hadn't seen the cast being done on the field
dt_finalizacao! The query actually runs inside a pgplsql procedure, so
the parameters are unknown constants (variables), but the situation got
much much better when I forced a cast on the constant part of the
condition (as follows), so allowing the planner to correctly choose an
index :

explain analyze SELECT dt_finalizacao
FROM cham_servico
WHERE id_chave_grupo = '7458' AND cod_bxs = 1
AND dt_finalizacao > CAST(to_timestamp ('2004-04-20','YYYY-MM-DD') AS
TIMESTAMP(0) WITHOUT TIME ZONE)
AND dt_finalizacao < CAST(now() AS TIMESTAMP(0) WITHOUT TIME ZONE)
ORDER BY dt_finalizacao ASC LIMIT 1;

QUERY
PLAN

------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..364.98 rows=1 width=8) (actual time=52.73..52.90
rows=1 loops=1)
-> Index Scan using xie1cham_servico on cham_servico
(cost=0.00..9756.63 rows=27 width=8) (actual time=52.73..52.89 rows=2
loops=1)
Index Cond: ((dt_finalizacao >
(to_timestamp('2004-04-20'::text, 'YYYY-MM-DD'::text))::timestamp(0)
without time zone) AND (dt_finalizacao < (now())::timestamp(0) without
time zone))
Filter: ((id_chave_grupo = '7458'::character varying) AND
(cod_bxs = 1))
Total runtime: 53.11 msec
(5 rows)

Unfortunately, I really must use timestamp without time zone for some
other unrelated reasons ...

Thank you very much for your attention

Best wishes,
Ian.

Tom Lane wrote:

>Ian Ribas <ian(dot)ribas(at)digitro(dot)com(dot)br> writes:
>
>
>>The index I created to try to optimize it is "xie3cham_servico", and
>>has all the three columns of the table that are used in the query. But
>>the index being used is actually xie2cham_servico (as can be seen
>>below) which has only two columns, one of them completely useless for
>>the query! I'm probably missing some setting, but I couldn't figure
>>out which.
>>
>>
>
>The optimizer's cost equations do say that a smaller index should be
>preferred over a larger one if the number of rows obtainable from the
>index is comparable (which I trust you will agree is reasonable).
>I think the real issue here is a datatype conflict. Look closely
>at the filter conditions:
>
>
>
>> -> Index Scan using xie2cham_servico on cham_servico (cost=0.00..15663.15 rows=26 width=8) (actual time=13604.37..13980.16 rows=3173 loops=1)
>> Index Cond: (id_chave_grupo = '7458'::character varying)
>> Filter: ((cod_bxs = 1) AND ((dt_finalizacao)::timestamp with time zone > to_timestamp('2004-04-20'::text, 'YYYY-MM-DD'::text)) AND ((dt_finalizacao)::timestamp with time zone < now()))
>>
>>
>
>The two comparisons on dt_finalizacao are unindexable because they
>coerce it to a different datatype; the "<" and ">" are the operators on
>timestamp with time zone and so they are not relevant to an index on
>timestamp without time zone.
>
>Very likely your column really ought to be of type timestamp with time
>zone. If you are convinced it should be timestamp without time zone
>then you need to fix the values you are comparing to.
>
>I'm also a bit disturbed by the fact that the rows estimates are off by
>more than two orders of magnitude (26 est vs 3173 actual). It would
>perhaps help if you wrote the boundary date as a simple constant:
>
> ... AND dt_finalizacao > '2004-04-20' AND ...
>
>so that the planner had some chance of comparing the constant to its
>statistics about the distribution of dt_finalizacao. to_timestamp is
>not a constant-foldable function (because it depends on the current
>TimeZone setting) and so the query as you've written it looks like
>
> ... AND dt_finalizacao > unpredictable_expression AND ...
>
>as far as the planner knows. (I'm not sure this will help though;
>it may be that the big problem is that the three columns are not
>independent, which is something the planner doesn't know because it has
>no cross-column statistics.)
>
> regards, tom lane
>
>
>