Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Statistics and selectivity estimation for ranges
Date: 2012-08-04 09:31:07
Message-ID: CAPpHfdtbdtM1_rhta+2cNdB4Pj7MH8EATQZHDPzK3s64qS61eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

attached patch is for collecting statistics and selectivity estimation for
ranges.

In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of range
into some kind of 2d-histogram. However, this patch use some simplification
and assume distribution of lower bound and distribution of length to be
independent. We can get distribution of lower bound from standard scalar
statistics and thi patch additionally collect statistics for range length.
This patch includes selectivity estimations for "&&", "@>" and "<@"
operators on ranges. Linear interpolation is used in order to get more
accurate results.

Some examples with test dataset where left bound of range is distributed by
gaussian distribution and length of range is distributed by exponential
distribution.

test=# CREATE TABLE range_test as (SELECT int4range(lb, lb + len) AS r FROM
(SELECT (sqrt(-2*ln(random())) * sin(2*pi()*random()) * 1000000)::int as
lb, (-10000*ln(1.0 - random()) + 1)::int as len FROM
generate_series(1,1000000)) x);
SELECT 1000000

test=# ANALYZE range_test;
ANALYZE

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(700000,710000);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=7535 width=14) (actual
time=0.119..403.494 rows=6138 loops=1)
Filter: (r && '[700000,710000)'::int4range)
Rows Removed by Filter: 993862
Total runtime: 403.945 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(200000,300000);
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*42427* width=14)
(actual time=0.100..401.079 rows=*42725* loops=1)
Filter: (r && '[200000,300000)'::int4range)
Rows Removed by Filter: 957275
Total runtime: 403.055 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(100000,150000);
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*15341* width=14)
(actual time=0.129..382.114 rows=*16014* loops=1)
Filter: (r <@ '[100000,150000)'::int4range)
Rows Removed by Filter: 983986
Total runtime: 382.985 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(600000,603000);
QUERY PLAN

---------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*122* width=14) (actual
time=1.527..383.511 rows=*127* loops=1)
Filter: (r <@ '[600000,603000)'::int4range)
Rows Removed by Filter: 999873
Total runtime: 383.586 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(100000,100400);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*5166* width=14) (actual
time=0.238..377.712 rows=*3909* loops=1)
Filter: (r @> '[100000,100400)'::int4range)
Rows Removed by Filter: 996091
Total runtime: 378.018 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(500000,530000);
QUERY PLAN

----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*342* width=14) (actual
time=11.796..382.986 rows=*171* loops=1)
Filter: (r @> '[500000,530000)'::int4range)
Rows Removed by Filter: 999829
Total runtime: 383.066 ms
(4 rows)

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
range_stat-0.2.patch.gz application/x-gzip 9.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2012-08-04 14:21:06 Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Previous Message Amit Kapila 2012-08-04 08:01:29 [WIP] Performance Improvement by reducing WAL for Update Operation