Re: BUG #7495: chosen wrong index

Lists: pgsql-bugs
From: psql(at)elbrief(dot)de
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #7495: chosen wrong index
Date: 2012-08-15 13:52:29
Message-ID: E1T1e1N-0004mk-2z@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 7495
Logged by: Andreas
Email address: psql(at)elbrief(dot)de
PostgreSQL version: 9.1.4
Operating system: Debian Linux
Description:

Hello.

create table bla ( a int , b int ) ;

insert into bla ( a , b ) select a , a from generate_series( 1 , 1000000 )
as a ( a ) ;

create index bla_a on bla ( a ) ;

create index bla_b on bla ( b ) ;

explain analyze select * from bla where b > 990000 limit 10 ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.27 rows=10 width=8) (actual time=0.150..0.173 rows=10
loops=1)
-> Index Scan using bla_b on bla (cost=0.00..265.29 rows=10000 width=8)
(actual time=0.147..0.159 rows=10 loops=1)
Index Cond: (b > 990000)
Total runtime: 0.226 ms

explain analyze select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..26.32 rows=10 width=8) (actual time=991.096..991.113
rows=10 loops=1)
-> Index Scan using bla_a on bla (cost=0.00..26322.29 rows=10000
width=8) (actual time=991.093..991.103 rows=10 loops=1)
Filter: (b > 990000)
Total runtime: 991.164 ms

explain analyze select * from ( select * from bla where b > 990000 union
select * from bla where b < 0 ) a order by a limit 10 ;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=835.76..835.78 rows=10 width=8) (actual time=51.551..51.571
rows=10 loops=1)
-> Sort (cost=835.76..860.76 rows=10001 width=8) (actual
time=51.547..51.548 rows=10 loops=1)
Sort Key: wasnoch.bla.a
Sort Method: top-N heapsort Memory: 17kB
-> HashAggregate (cost=419.62..519.63 rows=10001 width=8) (actual
time=32.061..42.544 rows=10000 loops=1)
-> Append (cost=0.00..369.62 rows=10001 width=8) (actual
time=0.037..19.857 rows=10000 loops=1)
-> Index Scan using bla_b on bla (cost=0.00..265.29
rows=10000 width=8) (actual time=0.035..11.538 rows=10000 loops=1)
Index Cond: (b > 990000)
-> Index Scan using bla_b on bla (cost=0.00..4.31
rows=1 width=8) (actual time=0.012..0.012 rows=0 loops=1)
Index Cond: (b < 0)
Total runtime: 51.678 ms

seq_page_cost = 1.0
random_page_cost = 20.0
restart server

explain analyze select * from bla where b > 997400 order by a limit 10 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=253.37..253.40 rows=10 width=8) (actual time=3.642..3.653
rows=10 loops=1)
-> Sort (cost=253.37..259.87 rows=2600 width=8) (actual
time=3.639..3.643 rows=10 loops=1)
Sort Key: a
Sort Method: top-N heapsort Memory: 17kB
-> Index Scan using bla_b on bla (cost=0.00..197.19 rows=2600
width=8) (actual time=0.041..2.155 rows=2600 loops=1)
Index Cond: (b > 997400)
Total runtime: 3.698 ms

seq_page_cost = 1.0
random_page_cost = 2.0
restart server

explain analyze select * from bla where b > 997400 order by a limit 10 ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..101.24 rows=10 width=8) (actual time=726.649..726.667
rows=10 loops=1)
-> Index Scan using bla_a on bla (cost=0.00..26322.29 rows=2600
width=8) (actual time=726.642..726.652 rows=10 loops=1)
Filter: (b > 997400)
Total runtime: 726.731 ms

explain analyze select * from bla where b > 997699 order by a limit 10 ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=114.29..114.31 rows=10 width=8) (actual time=4.009..4.020
rows=10 loops=1)
-> Sort (cost=114.29..120.04 rows=2301 width=8) (actual
time=4.007..4.011 rows=10 loops=1)
Sort Key: a
Sort Method: top-N heapsort Memory: 17kB
-> Index Scan using bla_b on bla (cost=0.00..64.56 rows=2301
width=8) (actual time=0.068..2.448 rows=2301 loops=1)
Index Cond: (b > 997699)
Total runtime: 4.073 ms

i have also played with cpu_tuple_cost, cpu_index_tuple_cost
and cpu_operator_cost, but there i have not found a setting
which chose index bla_b under b > 996000. but till b > 900000
it is faster to chose bla_b instead of bla_a.

i think the planner estimate the wrong amount of costs.

best regards,
Andreas


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <psql(at)elbrief(dot)de>,<pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #7495: chosen wrong index
Date: 2012-08-15 15:08:21
Message-ID: 502B75150200002500049751@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

<psql(at)elbrief(dot)de> wrote:

> insert into bla ( a , b )
> select a , a
> from generate_series( 1 , 1000000 ) as a ( a ) ;

> explain analyze select * from bla
> where b > 990000 order by a limit 10 ;
> [uses index on b and has a long run time]

The problem is that PostgreSQL doesn't have any sense of the
correlation between columns a and b (i.e., they are always equal)
and assumes that it will find enough matching rows soon enough on
the scan of the index on b to make it cheaper than sorting the
results of finding all rows that match the predicate. Try your test
suite again with the only change being the insert statement:

insert into bla ( a , b )
select a , floor(random() * 1000000) + 1
from generate_series( 1 , 1000000 ) as a ( a ) ;

On my machine, with that data, all of the queries run fast.

We've been looking at ways to develop statistics on multiple
columns, so that correlations like that don't confuse the optimizer,
or trying to evaluate the "risk" of a query taking a long time based
on unexpected correlations.

Not really a bug; more like a recognized opportunity to improve the
optimizer.

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <psql(at)elbrief(dot)de>,<pgsql-bugs(at)postgresql(dot)org>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject: Re: BUG #7495: chosen wrong index
Date: 2012-08-15 19:38:54
Message-ID: 502BB47E020000250004975C@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
<> psql(at)elbrief(dot)de> wrote:
>
>> insert into bla ( a , b )
>> select a , a
>> from generate_series( 1 , 1000000 ) as a ( a ) ;
>
>> explain analyze select * from bla
>> where b > 990000 order by a limit 10 ;
>> [uses index on b and has a long run time]
>
> The problem is that PostgreSQL doesn't have any sense of the
> correlation between columns a and b (i.e., they are always equal)

A belated thought on this -- you went right from your load to
running queries without leaving any time for autovacuum to kick in
and generate statistics. I recommend that somewhere after you
insert the data and before you run your selects, you run:

VACUUM ANALYZE bla;

This will get you to something which more closely resembles a
"steady state" environment.

-Kevin


From: psql(at)elbrief(dot)de
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re-2: BUG #7495: chosen wrong index
Date: 2012-08-16 10:12:55
Message-ID: 1345111975.8Bc0D0.11866@debian2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> insert into bla ( a , b )
> select a , floor(random() * 1000000) 1
> from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-------------------------------------------------------------------------------
Limit (cost=0.00..30.75 rows=10 width=8)
-> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8)
Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-----------------------------------------------------------------------------------
Limit (cost=633.50..633.52 rows=10 width=8)
-> Sort (cost=633.50..658.50 rows=10000 width=8)
Sort Key: a
-> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8)
Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.

Best regards,
Andreas


From: psql(at)elbrief(dot)de
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re-2: BUG #7495: chosen wrong index
Date: 2012-08-16 10:13:23
Message-ID: 1345112003.31477D0.11867@debian2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> insert into bla ( a , b )
> select a , floor(random() * 1000000) 1
> from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-------------------------------------------------------------------------------
Limit (cost=0.00..30.75 rows=10 width=8)
-> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8)
Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-----------------------------------------------------------------------------------
Limit (cost=633.50..633.52 rows=10 width=8)
-> Sort (cost=633.50..658.50 rows=10000 width=8)
Sort Key: a
-> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8)
Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.

Best regards,
Andreas


From: psql(at)elbrief(dot)de
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re-2: BUG #7495: chosen wrong index
Date: 2012-08-16 10:13:43
Message-ID: 1345112023.CaCf3400.11949@debian2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> insert into bla ( a , b )
> select a , floor(random() * 1000000) 1
> from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-------------------------------------------------------------------------------
Limit (cost=0.00..30.75 rows=10 width=8)
-> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8)
Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-----------------------------------------------------------------------------------
Limit (cost=633.50..633.52 rows=10 width=8)
-> Sort (cost=633.50..658.50 rows=10000 width=8)
Sort Key: a
-> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8)
Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.

Best regards,
Andreas


From: psql(at)elbrief(dot)de
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re-2: BUG #7495: chosen wrong index
Date: 2012-08-16 10:14:27
Message-ID: 1345112066.AD21FD7F0.11678@debian2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> insert into bla ( a , b )
> select a , floor(random() * 1000000) 1
> from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-------------------------------------------------------------------------------
Limit (cost=0.00..30.75 rows=10 width=8)
-> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8)
Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-----------------------------------------------------------------------------------
Limit (cost=633.50..633.52 rows=10 width=8)
-> Sort (cost=633.50..658.50 rows=10000 width=8)
Sort Key: a
-> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8)
Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.

Best regards,
Andreas


From: psql(at)elbrief(dot)de
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re-2: BUG #7495: chosen wrong index
Date: 2012-08-16 10:15:54
Message-ID: 1345112154.CD6Ca0.12014@debian2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> insert into bla ( a , b )
> select a , floor(random() * 1000000) 1
> from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-------------------------------------------------------------------------------
Limit (cost=0.00..30.75 rows=10 width=8)
-> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8)
Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
QUERY PLAN
-----------------------------------------------------------------------------------
Limit (cost=633.50..633.52 rows=10 width=8)
-> Sort (cost=633.50..658.50 rows=10000 width=8)
Sort Key: a
-> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8)
Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.

Best regards,
Andreas


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <psql(at)elbrief(dot)de>,<pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Re-2: BUG #7495: chosen wrong index
Date: 2012-08-16 16:46:30
Message-ID: 502CDD9602000025000497B7@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

<psql(at)elbrief(dot)de> wrote:

> In my live environment i have a table with a boolean where the
> boolean is usually true. The boolean is false on new or changed
> entrys and if i select the false-rows order by primary key i get
> slow querys. The table has a lot of million rows and a very small
> amount of rows with false.

That sure sounds like a situation where you should use a partial
index.

http://www.postgresql.org/docs/current/interactive/indexes-partial.html

-Kevin