Again on index correlation

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Again on index correlation
Date: 2003-08-20 19:21:14
Message-ID: dhd7kvs4niqijnerr9mi38oeih1o7j2s28@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Recent discussion of index cost estimation ("[HACKERS] Correlation in
cost_index()" ca. two weeks ago) has lead to the conclusion that the
column correlation calculated by VACUUM does not always help when we
want to find out how well index access order corresponds to physical
tuple position. Most problems arise when dealing with multi-column or
functional indices. Another case is when there are many equal values
in the first index column.

What we really want is not column correlation, but index correlation
which turns out to be surprisingly easy to calculate. There's no need
to look up comparison functions and handle different datatypes;
there's no need to look at the key values at all: Just read the index
items in index order, sort them by heap page, and compute the Spearman
Rho function. This even works for non-btree indices.

Try it yourself:
. download http://www.pivot.at/pg/contrib_icorrel.tgz
. unpack
. make
. make install
. psql
. \i path/to/share/postgresql/contrib/icorrel.sql
. SELECT icorrel('myindex'::regclass);

This should work with 7.3.x and 7.4Beta.

How could the planner make use of index correlation? Here
(http://www.pivot.at/pg/22-IndexCorrel_74b1.diff) is an experimental
patch that introduces a new system table pg_indexstat. If the GUC
variable enable_indexstat is set to on, genericcostestimate tries to
get index correlation from pg_indexstat; if there is none,
btcostestimate falls back to the old method.

Compatibility: Although there is a new catalog table, initdb is not
required. A patched postmaster still works with an old cluster, even
without creating pg_catalog.pg_indexstat. Before you can make use of
pg_indexstat you have to create it via a standalone backend:

$ bin/postgres -D data -O template1

backend> CREATE TABLE pg_catalog.pg_indexstat( \
istindex oid NOT NULL, \
istcorrel float4 NOT NULL) WITHOUT OIDS;
backend> CREATE UNIQUE INDEX pg_indexstat_index_index \
ON pg_indexstat(istindex);

Repeat this for each database.

Usage example (using Sean's data):

psql testdb
testdb=# \d rucc
Table "public.rucc"
Column | Type | Modifiers
---------------+--------------------------+-----------
user_id | integer | not null
category_id | integer | not null
img_bytes | bigint | not null
img_hits | integer | not null
html_bytes | bigint | not null
html_hits | integer | not null
unknown_bytes | bigint | not null
unknown_hits | integer | not null
utc_date | timestamp with time zone | not null
time_interval | interval | not null
Indexes:
"rucc_htmlbytes_idx" btree (html_bytes),
"rucc_id_date_idx" btree (user_id, utc_date)

testdb=# SELECT 'rucc_id_date_idx'::regclass::oid;
oid
---------
1281422
(1 row)

testdb=# set enable_seqscan = off;
SET

testdb=# set enable_indexstat = on;
SET

testdb=# INSERT INTO pg_indexstat VALUES (1281422, 0.0001);
INSERT 0 1
testdb=# EXPLAIN SELECT * FROM rucc WHERE user_id < 1000;
QUERY PLAN
--------------------------------------------------------------------------------
Index Scan using rucc_id_date_idx on rucc (cost=0.00..634342.50
rows=139802 width=64)
Index Cond: (user_id < 1000)
(2 rows)

testdb=# UPDATE pg_indexstat SET istcorrel=0.1 WHERE istindex=1281422;
testdb=# EXPLAIN SELECT ...

istcorrel | cost
----------+----------
0.0001 | 634342.50
0.1 | 514678.48
0.2 | 407497.85
0.5 | 161612.89
0.9 | 10299.07
1.0 | 3994.32

Actually the table is clustered on rucc_id_date_idx, so index
correlation is 1.0, but there is no way to know that, when we only
have the column correlations for user_id (1.0) and utc_date (0.59).
The current code guesses the index correlation to be 0.5 which gives a
cost estimate that is far too high.

For comparison:
seq scan estimated cost ~ 21000, actual ~ 11500,
index scan actual ~ 4000

If you are going to test this patch, please be aware that I created it
on top of another one of my experimental patches
(http://www.pivot.at/pg/16d-correlation_74b1.diff). If you don't want
to apply this one, one hunk of the IndexCorrel patch will fail in
selfuncs.c. Should be no problem to apply it manually.

And those who are still experimenting with 7.3.4 performance, can use
http://www.pivot.at/pg/16d-correlation_734.diff and
http://www.pivot.at/pg/22-IndexCorrel_74b1.diff.

ToDo:

.. Move get_index_correlation from contrib into the backend.

.. ANALYZE table computes index correlation for all indexes.

.. New command ANALYZE index?

.. System cache invalidation?
syscache.c: reloidattr = Anum_pg_indexstat_istindex ?

.. Dependency?

.. Remove GUC variable enable_indexstat

.. Remove old method in btcostestimate()

.. Automatic catalog upgrade

Servus
Manfred

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-08-20 19:39:26 Re: Buglist
Previous Message Robert Treat 2003-08-20 18:57:10 Re: Need concrete "Why Postgres not MySQL" bullet