Re: GiST for range types (was Re: Range Types - typo + NULL string constructor)

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: GiST for range types (was Re: Range Types - typo + NULL string constructor)
Date: 2011-11-09 16:24:00
Message-ID: CAPpHfds5M21x4idTOO4gYyPEu3B0JBtUqX2srTDPa4dDNT3UyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

New version of GiST for range types patch is here. This version seems to be
complete and ready for review.

Main features of this patch are:
1) Different handling of empty ranges. Problem of "column <@ const"
acceleration is that all empty ranges satisfies to it. It's because empty
range can be anywhere in the tree, because empty range is contained in any
range. My solution is to use additional bit in range flags especially for
GiST. This bit means that internal range have underlying empty ranges. This
approach allows to accelerate "column <@ const" case and also fast search
for empty ranges.
2) Penalty and picksplit methods are trying not to mix ordinal ranges with
special cases ((-inf;x), (x; +inf), (-inf, +inf) and empty). Picksplit
divides ranges to classes (see get_gist_range_class for details). If
several classes are presented in input vector, split will be between
classes. If we have only one class, picksplit method splits ranges using
double sorting split for ordinal ranges, and simple sorting split for
(-inf,x) and (x,+inf) cases.

Questions:
1) I'm not sure about whether we need support of <> in GiST, because it
always produces full index scan (except search for non-empty ranges).
2) With no selectivity estimation functions we have no GiST index usage for
&&, <@, @>. Possible temporarty solution is to create lower constant
selectivity estimation like we have for geometric datatypes.

A little testing:
Let's create dataset with 1000 empty ranges, 1000 (-inf, x) ranges, 1000
(x, inf) ranges, 1000 (-inf, + inf) ranges and 1000000 ordinal ranges and
mix them.

create table temp (r int4range);
insert into temp (r) (select int4range() from generate_series(1,1000));
insert into temp (r) (select int4range(NULL, NULL) from
generate_series(1,1000));
insert into temp (r) (select int4range((random()*1000000)::int, NULL) from
generate_series(1,1000));
insert into temp (r) (select int4range(NULL,(random()*1000000)::int) from
generate_series(1,1000));
insert into temp (select int4range(v,v+1+(random()*100)::int) from (select
(random()*1000000)::int v from generate_series(1,1000000)) x);
create table range_test as (select * from temp order by random());
drop table temp;

There are some results without patch.

test=# create index range_idx on range_test using gist (r);
CREATE INDEX
Time: 174578,153 ms

test=# explain (analyze, buffers) select * from range_test where r @>
int4range(500000,500010);
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on range_test (cost=17659.03..29840.03 rows=502000
width=17) (actual time=87.961..116.074 rows=2028 loops=1)
Recheck Cond: (r @> '[500000,500010)'::int4range)
Buffers: shared hit=192 read=1856 written=886
-> Bitmap Index Scan on range_idx (cost=0.00..17533.53 rows=502000
width=0) (actual time=87.591..87.591 rows=2028 loops=1)
Index Cond: (r @> '[500000,500010)'::int4range)
Buffers: shared hit=185 read=139
Total runtime: 118.804 ms
(7 rows)

test=# explain (analyze, buffers) select * from range_test where r <@
int4range(500000,501000);
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on range_test (cost=17659.03..29840.03 rows=502000
width=17) (actual time=1725.249..1744.327 rows=1994 loops=1)
Recheck Cond: (r <@ '[500000,501000)'::int4range)
Buffers: shared hit=3 read=8604
-> Bitmap Index Scan on range_idx (cost=0.00..17533.53 rows=502000
width=0) (actual time=1724.834..1724.834 rows=1994 loops=1)
Index Cond: (r <@ '[500000,501000)'::int4range)
Buffers: shared hit=3 read=6880
Total runtime: 1747.007 ms
(7 rows)

test=# explain (analyze, buffers) select * from range_test where r &&
int4range(500000,501000);
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on range_test (cost=17659.03..29840.03 rows=502000
width=17) (actual time=88.196..105.423 rows=3082 loops=1)
Recheck Cond: (r && '[500000,501000)'::int4range)
Buffers: shared hit=1183 read=1545
-> Bitmap Index Scan on range_idx (cost=0.00..17533.53 rows=502000
width=0) (actual time=87.638..87.638 rows=3082 loops=1)
Index Cond: (r && '[500000,501000)'::int4range)
Buffers: shared hit=44 read=286
Total runtime: 109.486 ms
(7 rows)

And results for same queries with patch.

test=# create index range_idx on range_test using gist (r);
CREATE INDEX
Time: 79220,888 ms

test=# explain (analyze, buffers) select * from range_test where r @>
int4range(500000,500010);
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on range_test (cost=15954.99..28135.99 rows=502000
width=36) (actual time=9.699..36.352 rows=2028 loops=1)
Recheck Cond: (r @> '[500000,500010)'::int4range)
Buffers: shared hit=8 read=1734 written=556
-> Bitmap Index Scan on range_idx (cost=0.00..15829.49 rows=502000
width=0) (actual time=9.297..9.297 rows=2028 loops=1)
Index Cond: (r @> '[500000,500010)'::int4range)
Buffers: shared hit=8 read=10
Total runtime: 39.229 ms
(7 rows)

test=# explain (analyze, buffers) select * from range_test where r <@
int4range(500000,501000);
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on range_test (cost=15954.99..28135.99 rows=502000
width=17) (actual time=15.886..25.571 rows=1994 loops=1)
Recheck Cond: (r <@ '[500000,501000)'::int4range)
Buffers: shared hit=1199 read=554
-> Bitmap Index Scan on range_idx (cost=0.00..15829.49 rows=502000
width=0) (actual time=15.518..15.518 rows=1994 loops=1)
Index Cond: (r <@ '[500000,501000)'::int4range)
Buffers: shared hit=23 read=6
Total runtime: 28.243 ms
(7 rows)

test=# explain (analyze, buffers) select * from range_test where r &&
int4range(500000,501000);
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on range_test (cost=15954.99..28135.99 rows=502000
width=17) (actual time=7.845..19.686 rows=3082 loops=1)
Recheck Cond: (r && '[500000,501000)'::int4range)
Buffers: shared hit=2389 read=33
-> Bitmap Index Scan on range_idx (cost=0.00..15829.49 rows=502000
width=0) (actual time=7.282..7.282 rows=3082 loops=1)
Index Cond: (r && '[500000,501000)'::int4range)
Buffers: shared hit=24
Total runtime: 23.728 ms
(7 rows)

We can see pretty dramatical acceleration. Also index build becomes faster,
actually I don't know why :)

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
rangetypegist-0.2.patch.zip application/zip 10.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-11-09 16:28:55 Re: const correctness
Previous Message Andrew Dunstan 2011-11-09 16:19:46 parallel make failure