How can an index be larger than a table

From: "David Roussel" <pgsql-performance(at)diroussel(dot)xsmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: How can an index be larger than a table
Date: 2005-04-21 10:13:10
Message-ID: 1114078390.26441.232391808@webmail.messagingengine.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I have a series of tables with identical structure. Some contain a few
thousand rows and some contain 3,000,000 rows. Another applicate writes
the rows and my applicate reads then just by selecting where pk >
last_seen_pk limit 2000.

I've found that one of the tables, when selecting from it that one of
the tables is many times slower than the others.

For instance when reading data in batches of 2000 rows, it seems to take
26 seconds to query from dave_data_update_events with 1593600, but only
1 or two seconds to query from jane_data_update_events with 3100000
rows!

This is ther SQL used....

|
|select
| events.event_id, ctrl.real_name, events.tsds, events.value,
| events.lds, events.correction, ctrl.type, ctrl.freq
|from dave_data_update_events events, dave_control ctrl
|where events.obj_id = ctrl.obj_id and
|events.event_id > 32128893::bigint
|order by events.event_id
|limit 2000
|

Here is the structure of the tables...

|
|CREATE TABLE dave_control (
| obj_id numeric(6,0) NOT NULL,
| real_name character varying(64) NOT NULL,
| "type" numeric(2,0) NOT NULL,
| freq numeric(2,0) NOT NULL
|);
|
|CREATE TABLE dave_data_update_events (
| lds numeric(13,0) NOT NULL,
| obj_id numeric(6,0) NOT NULL,
| tsds numeric(13,0) NOT NULL,
| value character varying(22) NOT NULL,
| correction numeric(1,0) NOT NULL,
| delta_lds_tsds numeric(13,0) NOT NULL,
| event_id bigserial NOT NULL
|);
|
|CREATE UNIQUE INDEX dave_control_obj_id_idx ON dave_control USING btree
(obj_id);
|ALTER TABLE dave_control CLUSTER ON dave_control_obj_id_idx;
|
|CREATE UNIQUE INDEX dave_control_real_name_idx ON dave_control USING
btree (real_name);
|
|CREATE INDEX dave_data_update_events_lds_idx ON dave_data_update_events
USING btree (lds);
|
|CREATE INDEX dave_data_update_events_obj_id_idx ON
dave_data_update_events USING btree (obj_id);
|
|ALTER TABLE ONLY dave_control
| ADD CONSTRAINT dave_control_obj_id_key UNIQUE (obj_id);
|
|ALTER TABLE ONLY dave_control
| ADD CONSTRAINT dave_control_real_name_key UNIQUE (real_name);
|
|ALTER TABLE ONLY dave_data_update_events
| ADD CONSTRAINT dave_data_update_events_event_id_key UNIQUE
(event_id);
|

There are several pairs of tables, but with names like rod, jane,
fredie, etc.. instead of dave.
The first thing to note about the scheme (not designed by me) is that
the control table is clustered on obj_id, but the data_update_events
table is not clustered. Does that mean the rows will be stored in order
of insert? That might be ok, because data_update_events table is like a
queue and I read it in the order the new rows are inserted.

What also seems weird to me is that the control table has some unique
indexes created on it, but the data_upate_events table just has a unique
constraint. Will postgres use an index in the background to enforce
this constraint?

When looking at the indexes on the all the tables in DbVisualiser my
colleague noticed that the cardinality of the indexes on the rod, jane
and fredie tables was consistent, but for dave the cardinality was
strange...

|
|SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE
relname LIKE 'dave_data%';
|
|relname relkind reltuples relpages
|======================================= ======= ========= ========
|dave_data_update_events r 1593600.0 40209
|dave_data_update_events_event_id_keyi1912320.0 29271
|dave_data_update_events_event_id_seqS 1.0 1
|dave_data_update_events_lds_idx i 1593600.0 6139
|dave_data_update_events_obj_id_idx i 1593600.0 6139
|iso_pjm_data_update_events_obj_id_idxi1593600.0 6139
|

Note that there are only 1593600 rows in the table, so why the 1912320
figure?

Of course I checked that the row count was correct...

|
|EXPLAIN ANALYZE
|select count(*) from iso_pjm_data_update_events
|
|QUERY PLAN
|Aggregate (cost=60129.00..60129.00 rows=1 width=0) (actual
time=35933.292..35933.293 rows=1 loops=1)
| -> Seq Scan on iso_pjm_data_update_events (cost=0.00..56145.00
rows=1593600 width=0) (actual time=0.213..27919.497 rows=1593600
loops=1)
|Total runtime: 35933.489 ms
|

and...

|
|select count(*) from iso_pjm_data_update_events
|
|count
|1593600
|

so it's not that there are any undeleted rows lying around.

So any comments on the index structure? Any ideas why the cardinality
of the index is greater than the number of rows in the table? Was it
because the table used to be larger?

Also any ideas on how else to track down the big performance difference
between tables of the same structure?

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shoaib Burq (VPAC) 2005-04-21 11:49:53 two queries and dual cpu (perplexed)
Previous Message Richard van den Berg 2005-04-21 08:15:40 Re: When are index scans used over seq scans?