Testing of various opclasses for ranges

Lists: pgsql-hackers
From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Testing of various opclasses for ranges
Date: 2012-07-09 23:33:03
Message-ID: CAPpHfduAJ_pX_v-MFraE7x2sgMSJ6jEty=s5eCFGT8d1axvzOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

I've tested various opclasses for ranges (including currently in-core one
and my patches). I've looked into scholar papers for which datasets they
are using for testing. The lists below show kinds of datasets used in
papers.

1) "Advanced Indexing Technique for Temporal Data", "Indexing Valid Time
Intervals" uses two kinds of datasets:
a) uniformly distributed start of range and uniformly distributed size of
range
b) uniformly distributed start of range and exponentially distributed
size of range

2) "The Time index+: An incremental access structure for temporal
databases" uses dataset produced by simulating of objects "life". Each
object consists of number of versions which lifetime ranges are adjuncted.
Start of object life is uniformly distributed. Object lifetime is normally
distributed. Version time is exponentially distributed.

3) "Top-k Queries on Temporal Data"
a) Datasets of clusters. Each cluster center is normally distributed.
Offset of range inside cluster is also normally distributed. Range size is
distributed exponentially.
b) Real-life datasets http://www.cs.ucr.edu/~eamonn/time_series_data/.
Unfortunately I can't get password for them ((((

4) "Indexing Valid Time Intervals" uses two kinds of datasets
a) uniformly distributed start of range and exponentially distributed
size of range
b) uniformly distributed start of range and normally distributed size of
range

5) "Managing intervals efficiently in object-relational databases",
"Segment indexes: dynamic indexing techniques for multi-dimensional
interval data"
a) uniformly distributed start of range and exponentially distributed
size of range
b) uniformly distributed start of range and uniformly distributed size of
range
a) exponentially distributed start of range and exponentially distributed
size of range
b) exponentially distributed start of range and uniformly distributed
size of range

Therefore 3 basic distributions are used in synthetic datasets:
1) uniform
2) exponential
3) normal

Datasets can be classified into 3 kinds:
1) simple: ranges are distributed independently
2) clusters: ranges are grouped into clusters
3) lifetime: ranges are produced by life simulation

Each kind of dataset require some distributions for generation:
1) simple: range start and range size
2) clusters: cluster center, range offset, range size
3) lifetime: object lifetime start, object lifetime length, version length

In my testsuite each of 3 distribution is used in each slot. Additionally
mean size of range was varied. See attached range_test.php and
range_test_schema.sql.

I've merged all 3 patches into 1 (see 2d_map_range_indexing.patch). In this
patch following opclasses are available for ranges:
1) range_ops - currently in-core GiST opclass
2) range_ops2 - GiST opclass based on 2d-mapping
3) range_ops_quad - SP-GiST quad tree based opclass
4) range_ops_kd - SP-GiST k-d tree based opclass

There is average results of index build time depending on used operator
class.

test=# select opclass, avg(buildtime) from indexes group by opclass order
by opclass;
opclass | avg
----------------+------------------
range_ops | 16.1305697569772
range_ops2 | 21.6557774392386
range_ops_quad | 6.1000143980223
range_ops_kd | 4.97456835754334
(4 rows)

2d-mapping GiST is longer for build than 1d GiST. It seems to be inevitable
because 2d-mapping GiST has to try split in both dimensions and it has more
complicated calculations in penalty too. K-d tree is faster for build than
quad tree, I don't have convincing explanation why.

There is average number of page hits and average execution in milliseconds
for test queries depending on operator and opclass. We can see that
2d-mapping GiST use about 2 times less pages for "<@" operators. However,
it use a little more pages in "@>" and "&&" while I expected superiority
also for "@>". But average time of query execution is lower for all
operators. Probably, it's because consistent function appears to be more
efficiently implemented.
SP-GiST uses more pages than GiST. Especially k-d tree. However, some pages
could be counted more than once, but I think it's OK to compare SP-GiST
opclasses by this parameter between themselves. Also, k-d tree appears to
be slower than quad tree.

test=# select tr.operator, opclass, avg(tr.hits), avg(tr.time) from
test_results tr join indexes i on i.id = tr.index_id where tr.count > 0
group by tr.operator, i.opclass order by tr.operator, i.opclass;
operator | opclass | avg | avg
----------+----------------+----------------------+------------------
<@ | range_ops | 104.6797374137490417 | 5.63374670968584
<@ | range_ops2 | 57.6100418476871965 | 4.38479106503973
<@ | range_ops_kd | 362.4359706427293637 | 5.42760708296065
<@ | range_ops_quad | 185.4336426654740608 | 4.41001823648733
@> | range_ops | 102.7120835405271009 | 3.95064963751995
@> | range_ops2 | 112.8144023659347274 | 3.64920412729987
@> | range_ops_kd | 318.5045800727577272 | 3.54148674396084
@> | range_ops_quad | 118.8078201470857651 | 2.52750201523206
&& | range_ops | 104.6941111111111111 | 7.07480315079371
&& | range_ops2 | 106.7263531746031746 | 6.45640819841263
&& | range_ops_kd | 426.3542380952380952 | 7.41615567063479
&& | range_ops_quad | 247.2961468253968254 | 6.17403029761907
(12 rows)

In order to find why test results don't always meet my expectations I
filter tests to show only low-selectivity queries. Now, we can see
2d-mapping GiST is much more dramatically faster than 1d GiST on "<@" and
use slightly less amount of pages than 1d GiST. k-d tree appears to be much
slower than quad tree and use dramatically more pages. Probably, it is
caused by some bug. I will examine it.

test=# select tr.operator, opclass, avg(tr.hits), avg(tr.time) from
test_results tr join indexes i on i.id = tr.index_id where tr.count > 0 and
tr.count < 100 group by tr.operator, i.opclass order by tr.operator,
i.opclass;
operator | opclass | avg | avg
----------+----------------+----------------------+-------------------
<@ | range_ops | 111.1455328421708000 | 2.36361217183771
<@ | range_ops2 | 6.2042648127010480 | 0.181787537615442
<@ | range_ops_kd | 259.1292933485524541 | 1.39777430735706
<@ | range_ops_quad | 54.7487288575282764 | 0.318179931513958
@> | range_ops | 10.6586927630784101 | 0.309286698079016
@> | range_ops2 | 8.9114413434819379 | 0.215462788449922
@> | range_ops_kd | 286.6884740848133382 | 1.54235544279329
@> | range_ops_quad | 39.8035520115984052 | 0.231929382626555
&& | range_ops | 3.6001236093943140 | 0.217711372064277
&& | range_ops2 | 3.7082818294190358 | 0.18267737948084
&& | range_ops_kd | 299.8071693448702101 | 2.02300927070457
&& | range_ops_quad | 24.7459826946847960 | 0.183641532756489
(12 rows)

You can download full results of testing at
http://test.i-gene.ru/uploads/range_test.sql.gz.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
2d_map_range_indexing.patch application/octet-stream 121.5 KB
range_test.php application/x-httpd-php 7.9 KB
range_test_schema.sql application/octet-stream 7.5 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Testing of various opclasses for ranges
Date: 2012-07-10 09:38:52
Message-ID: 4FFBF82C.4080102@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10.07.2012 02:33, Alexander Korotkov wrote:
> Hackers,
>
> I've tested various opclasses for ranges (including currently in-core one
> and my patches). I've looked into scholar papers for which datasets they
> are using for testing. The lists below show kinds of datasets used in
> papers.

Great! That's a pretty comprehensive suite of datasets.

> I've merged all 3 patches into 1 (see 2d_map_range_indexing.patch). In this
> patch following opclasses are available for ranges:
> 1) range_ops - currently in-core GiST opclass
> 2) range_ops2 - GiST opclass based on 2d-mapping
> 3) range_ops_quad - SP-GiST quad tree based opclass
> 4) range_ops_kd - SP-GiST k-d tree based opclass

I think the ultimate question is, which ones of these should we include
in core? We cannot drop the existing range_ops opclass, if only because
that would break pg_upgrade. However, range_ops2 seems superior to it,
so I think we should make that the default for new indexes.

For SP-GiST, I don't think we need to include both quad and k-d tree
implementations. They have quite similar characteristics, so IMHO we
should just pick one. Which one would you prefer? Is there any
difference in terms of code complexity between them? Looking at the
performance test results, quad tree seems to be somewhat slower to
build, but is faster to query. Based on that, I think we should pick the
quad tree, query performance seems more important.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Testing of various opclasses for ranges
Date: 2012-07-10 22:36:45
Message-ID: CAPpHfdtFWs3R6DiqXwW=3f=DVOxxwZmh92mOihu02DEug6yr5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 10, 2012 at 1:38 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> I think the ultimate question is, which ones of these should we include in
> core? We cannot drop the existing range_ops opclass, if only because that
> would break pg_upgrade. However, range_ops2 seems superior to it, so I
> think we should make that the default for new indexes.
>

Actually, I'm not fully satisfied with range_ops2. I expect it could be
recommend for all cases, but actually it builds significantly slower and
sometimes requires more pages for search. Likely, we have to write some
recommedation in docs about which opclass to use in particular.
Additionally, somebody could think GiST range indexing becoming tangled.

For SP-GiST, I don't think we need to include both quad and k-d tree
> implementations. They have quite similar characteristics, so IMHO we should
> just pick one. Which one would you prefer? Is there any difference in terms
> of code complexity between them? Looking at the performance test results,
> quad tree seems to be somewhat slower to build, but is faster to query.
> Based on that, I think we should pick the quad tree, query performance
> seems more important.

Agree, I think we should stay at quad tree implemetation.

------
With best regards,
Alexander Korotkov.