WIP: 2d-mapping based indexing for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: WIP: 2d-mapping based indexing for ranges
Date: 2012-05-28 19:42:19
Message-ID: CAPpHfdsCtD_5wfKW8AO4_jvroOt0eY_rqu0kAON-J_xEH25g7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

attached patch implements another approach to indexing of ranges: mapping
lower and upper bounds as 2d-coordinates and using same indexing approaches
as for 2d points. Patch provides range_ops2 operator class which implements
this approach.

I did some tests on real-life dataset (attached periods.sql.gz). This
dataset contain date ranges from real-life data provided by Benedikt
Grundmann.

Index build is slightly slower.

test=# create index period_idx on period_test using gist (period);
CREATE INDEX
Time: 48493,241 ms

test=# create index period_idx2 on period_test2 using gist (period
range_ops2);
CREATE INDEX
Time: 54351,251 ms

It seems to be inevitably, because more complicated calculations are
required. For example, picksplit algorithm works along two axes instead of
one axis.

"Overlaps" queries seems to be slightly faster.

test=# explain (analyze, buffers) select * from period_test where period &&
daterange('2011-08-10', '2011-08-11');
QUERY PLAN

----------------------------------------------------------------------------------------------------
------------------------------
Bitmap Heap Scan on period_test (cost=471.65..13716.47 rows=11866
width=32) (actual time=107.260..147.541 rows=335190 loops=1)
Recheck Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=8563
-> Bitmap Index Scan on period_idx (cost=0.00..468.68 rows=11866
width=0) (actual time=106.600..106.600 rows=335190 loops=1)
Index Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=3878
Total runtime: 160.210 ms
(7 rows)

test=# explain (analyze, buffers) select * from period_test2 where period
&& daterange('2011-08-10', '2011-08-11');
QUERY PLAN

----------------------------------------------------------------------------------------------------
-----------------------------
Bitmap Heap Scan on period_test2 (cost=553.17..16034.26 rows=11866
width=32) (actual time=76.386..122.193 rows=335190 loops=1)
Recheck Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=9837
-> Bitmap Index Scan on period_idx2 (cost=0.00..550.21 rows=11866
width=0) (actual time=75.448..75.448 rows=335190 loops=1)
Index Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=3495
Total runtime: 134.942 ms
(7 rows)

However, difference is small, so more comprehensive testing required for
conclusion.

"Contained by" queries with small selectivity are dramatically faster as
expected.

test=# explain (analyze, buffers) select * from period_test where period <@
daterange('2011-08-10', '2011-08-11');
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on period_test (cost=102.08..6140.68 rows=2373 width=32)
(actual time=88.899..89.
160 rows=619 loops=1)
Recheck Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=3957
-> Bitmap Index Scan on period_idx (cost=0.00..101.49 rows=2373
width=0) (actual time=88.878..8
8.878 rows=619 loops=1)
Index Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=3878
Total runtime: 89.246 ms
(7 rows)

test=# explain (analyze, buffers) select * from period_test2 where period
<@ daterange('2011-08-10', '2011-08-11');
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on period_test2 (cost=119.60..6497.06 rows=2373
width=32) (actual time=3.696..4.7
09 rows=619 loops=1)
Recheck Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=165
-> Bitmap Index Scan on period_idx2 (cost=0.00..119.01 rows=2373
width=0) (actual time=3.644..3
.644 rows=619 loops=1)
Index Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
Buffers: shared read=77
Total runtime: 4.881 ms
(7 rows)

I failed to do "contains" queries with really low selectivity of this
dataset, because of specific distribution. On "contains" quieries new
indexing method is also faster.

test=# explain (analyze, buffers) select * from period_test where period @>
daterange('2011-08-10', '2011-12-31');
QUERY PLAN

----------------------------------------------------------------------------------------------------
---------------------------
Bitmap Heap Scan on period_test (cost=102.08..6140.68 rows=2373 width=32)
(actual time=59.015..76.653 rows=118097 loops=1)
Recheck Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
Buffers: shared read=4533
-> Bitmap Index Scan on period_idx (cost=0.00..101.49 rows=2373
width=0) (actual time=58.621..5
8.621 rows=118097 loops=1)
Index Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
Buffers: shared read=1799
Total runtime: 81.238 ms
(7 rows)

test=# explain (analyze, buffers) select * from period_test2 where period
@> daterange('2011-08-10', '2011-12-31');
QUERY PLAN

----------------------------------------------------------------------------------------------------
----------------------------
Bitmap Heap Scan on period_test2 (cost=119.60..6497.06 rows=2373
width=32) (actual time=36.657..58
.656 rows=118097 loops=1)
Recheck Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
Buffers: shared read=5494
-> Bitmap Index Scan on period_idx2 (cost=0.00..119.01 rows=2373
width=0) (actual time=36.123..36.123 rows=118097 loops=1)
Index Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
Buffers: shared read=1450
Total runtime: 63.400 ms
(7 rows)

Again superiority is not large enough for doing a conclusion by one dataset.

Now, let's talk about problems.

1) Handling infinities. While working on range_ops2 I liked to avoid quite
bloated logic that I implemented in range_ops. I tried to handle infinities
similar to just big values. However, this way appears to be much less
effective. In this case insted of keeping infinities separately, index
produce prolonged areas which are bad for performance. It seems that I have
to implement similar key "classes" handling but even more complicated
because of higher amount of "classes".

2) Limitations of "penalty" expressive power. GiST penalty function returns
just one float value. It is a limitation even for geometrical datatypes.
Some papers on R-trees recommends insert new entry into box which have
minimal area when this boxes have same extension area (very possible if
extension area is zero). We can't express it using single float value.
For 2d-mapping ranges indexing situation is even harder. Imagine, some set
of ranges could have same lower or upper bound. On 2d space such ranges
will be united into lines. But, we can't express increasing of line length
in penalty, because area of line is always zero independently on it's
length. Actually, similar problem exists for geometrical datatypes, but it
seems to me this problem is more urgent for ranges, because I believe in
real life ranges, large set of ranges could really have same lower or upper
bound.
Probably, we would like to rank key extension cases as following or similar:
a) extension of finite dimension to infinite
b) extension in dimension perpendicular to infinite
c) extension with finite area extension
d) extension with zero area increment (extension of line length)
So, it's impossible to fil all of these into one float. Ans it's hard to
choose what to neglect and how.
I think there are at least two ways to make GiSTinferface sensitive enough
to these things:
a) Make penalty function return more than one float. The question is how
many floats we allow?
b) Create alternative "choose" interface function which would return set of
most preferred keys for insert. Usage of such function could lead to
slowdown because currently GiST stop scan when met zero penalty.

3) It appears that with this patch we have 3 distinct implementation of
double sorting split algorithm. Seems that I have to generalize it with
some interface.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
2d_rangegist-0.3.patch.gz application/x-gzip 14.6 KB
periods.sql.gz application/x-gzip 802.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2012-05-28 19:50:00 Re: libpq URL syntax vs SQLAlchemy
Previous Message Marti Raudsepp 2012-05-28 19:37:01 Re: Bogus nestloop rows estimate in 8.4.7