Query with order by and limit is very slow - wrong index used

From: Nowak Michał <michal(dot)nowak(at)me(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Query with order by and limit is very slow - wrong index used
Date: 2011-10-03 09:44:21
Message-ID: 6B8FADC8-9692-439F-8AA4-0CAA287FCEB0@me.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Since it's my first on this list, I'd like to say "Hi guys" :)

Here is definition of my table:
a9-dev=> \d records;
Table "public.records"
Column | Type | Modifiers
--------------------------------------+-----------------------------+-----------
id | bigint | not null
checksum | character varying(32) | not null
data | text | not null
delete_date | timestamp without time zone |
last_processing_date | timestamp without time zone |
object_id | character varying(255) | not null
processing_path | character varying(255) | not null
schema_id | character varying(255) | not null
source_id | character varying(255) | not null
source_object_last_modification_date | timestamp without time zone | not null
Indexes:
"records_pkey" PRIMARY KEY, btree (id)
"unq_records_0" UNIQUE, btree (object_id, schema_id, source_id, processing_path)
"length_processing_path_id_idx" btree (length(processing_path::text), id)
"length_processing_path_idx" btree (length(processing_path::text))
"object_id_id_idx" btree (object_id, id)
"schema_id_id_idx" btree (schema_id, id)
"schema_id_idx" btree (schema_id)
"source_id_id_idx" btree (source_id, id)
"source_id_idx" btree (source_id)
"source_object_last_modification_date_id_idx" btree (source_object_last_modification_date, id)
"source_object_last_modification_date_idx" btree (source_object_last_modification_date)

Average length of value of "data" column = 2991.7947061626100466

When I perform query such as this: "select * from records where source_id = 'XXX' order by id limit 200;" I expect DB to use index source_id_id_idx with XXX as filter. It is true for all but one values of XXX - when I ask for records with most common source_id, records_pkey index is used instead and performance is terrible! Explain analyze results below.

a9-dev=> explain analyze select * from records where source_id ='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml' order by id limit 200;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..755.61 rows=200 width=1127) (actual time=75.292..684.582 rows=200 loops=1)
-> Index Scan using source_id_id_idx on records (cost=0.00..1563542.89 rows=413849 width=1127) (actual time=75.289..684.495 rows=200 loops=1)
Index Cond: ((source_id)::text = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 690.358 ms
(4 rows)

a9-dev=> explain analyze select * from records where source_id ='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml' order by id limit 200;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..466.22 rows=200 width=1127) (actual time=124093.485..124095.540 rows=200 loops=1)
-> Index Scan using records_pkey on records (cost=0.00..2333280.84 rows=1000937 width=1127) (actual time=124093.484..124095.501 rows=200 loops=1)
Filter: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 124130.247 ms
(4 rows)

Some info about data distrubution:

a9-dev=> select min(id) from records;
min
--------
190830
(1 row)

a9-dev=> select min(id), max(id) from records where source_id='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml';
min | max
---------+---------
1105217 | 3811326
(1 row)
a9-dev=> select min(id), max(id) from records where source_id='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml';
min | max
---------+---------
1544991 | 3811413
(1 row)

a9-dev=> select min(id), max(id) from (select id from records where source_id = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml' order by id limit 200) as a;
min | max
---------+---------
1105217 | 1105416
(1 row)

a9-dev=> select min(id), max(id) from (select id from records where source_id = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml' order by id limit 200) as a;
min | max
---------+---------
1544991 | 1545190
(1 row)

a9-dev=> select source_id, count(*) from records where source_id = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml' or source_id = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml' group by source_id;
source_id | count
--------------------------------------------------------+--------
http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml | 427254
http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml | 989184
(2 rows)

a9-dev=> select count(*) from records;
count
---------
3620311
(1 row)

DB settings:

a9-dev=> SELECT
a9-dev-> 'version'::text AS "name",
a9-dev-> version() AS "current_setting"
a9-dev-> UNION ALL
a9-dev-> SELECT
a9-dev-> name,current_setting(name)
a9-dev-> FROM pg_settings
a9-dev-> WHERE NOT source='default' AND NOT name IN
a9-dev-> ('config_file','data_directory','hba_file','ident_file',
a9-dev(> 'log_timezone','DateStyle','lc_messages','lc_monetary',
a9-dev(> 'lc_numeric','lc_time','timezone_abbreviations',
a9-dev(> 'default_text_search_config','application_name',
a9-dev(> 'transaction_deferrable','transaction_isolation',
a9-dev(> 'transaction_read_only');
name | current_setting
--------------------------+-----------------------------------------------------------------------------------------------------------------
version | PostgreSQL 9.0.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.6.0 20110530 (Red Hat 4.6.0-9), 64-bit
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
listen_addresses | *
log_rotation_age | 1d
log_rotation_size | 0
log_truncate_on_rotation | on
logging_collector | on
max_connections | 100
max_stack_depth | 2MB
port | 5432
server_encoding | UTF8
shared_buffers | 24MB
TimeZone | Poland
(14 rows)

This query was always slow. Autovacuum is on, and I ran VACUUM ANALYZE manually few minutes before writing this email.

Please help me with my problem. I'll be happy to provide any additional information if needed.
Michal Nowak

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Gregg Jaskiewicz 2011-10-03 10:02:26 Re: Query with order by and limit is very slow - wrong index used
Previous Message David Boreham 2011-10-03 05:57:42 Re: Suggestions for Intel 710 SSD test