Re: Using multidimensional indexes in ordinal queries

Lists: pgsql-hackers
From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Using multidimensional indexes in ordinal queries
Date: 2010-06-06 20:04:10
Message-ID: AANLkTimhFaq6hCibRnk0tlcQMIyhYWHwAQ2ZD87wbH86@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello hackers,

I would like to share some my thoughts about usage of multidimensional
indexes for queries which deal with ordinal unidimensional data types. I
think that gist indexes (especially with knngist) can produce great benefit
for complex multi-criterion queries.
Let's consider come example. I use postgresql-9.0beta1 with knngist patch.
Also I have created simple patch that allows to use knngist for ordinal
sorting in cube extension (patch is attached). The "<*>" operator was
introduced in my patch. The first operand is the cube and the second operand
is number n. If n = 2*k then the ascending ordering by k-dimension occurs.
If n = 2*k + 1 then descending ordering by k-dimension occurs. Now this
operator have a limitation and works only with nonnegative coordinate
values.
Let's create table with 3 float-point columns and fill it with 10M rows;

create table test (id serial primary key, v1 double precision, v2 double
precision, v3 double precision);
insert into test (v1,v2,v3) (select random()*1000, random()*1000,
random()*1000 from generate_series(1,10000000,1));

Now, let's create 3 separate btree indexes and one gist cube index.

create index test_v1_idx on test(v1);
create index test_v2_idx on test(v2);
create index test_v3_idx on test(v3);
create index test_cube_idx on test using gist(cube(ARRAY[v1,v2,v3]));

Let's consider some complex query with filtering, ordering and limit.

test=# select * from test where v1 between 480 and 500 and v2 between 480
and 500 order by v3 limit 10;
id | v1 | v2 | v3
----------+------------------+------------------+-------------------
12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135
4936086 | 497.239370364696 | 491.878624074161 | 1.26481195911765
8963067 | 484.963194001466 | 497.094289399683 | 1.30057940259576
12435440 | 498.670902103186 | 498.667187988758 | 1.33110675960779
11667415 | 494.398592971265 | 497.440234292299 | 1.44533207640052
8530558 | 482.85893118009 | 496.267838869244 | 1.48530444130301
4004942 | 483.679085504264 | 489.547223784029 | 1.57393841072917
14897796 | 491.37338064611 | 487.475242000073 | 1.81775307282805
4105759 | 489.506138022989 | 486.91446846351 | 1.94038823246956
12895656 | 499.508572742343 | 487.065799534321 | 2.34963605180383
(10 rows)

test=# explain analyze select * from test where v1 between 480 and 500 and
v2 between 480 and 500 order by v3 limit 10;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22786.73..22786.75 rows=10 width=28) (actual
time=3242.135..3242.162 rows=10 loops=1)
-> Sort (cost=22786.73..22797.59 rows=4345 width=28) (actual
time=3242.131..3242.144 rows=10 loops=1)
Sort Key: v3
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on test (cost=8755.91..22692.83 rows=4345
width=28) (actual time=1281.030..3234.934 rows=4027 loops=1)
Recheck Cond: ((v1 >= 480::double precision) AND (v1 <=
500::double precision) AND (v2 >= 480::double precision) AND (v2 <=
500::double precision))
-> BitmapAnd (cost=8755.91..8755.91 rows=4345 width=0)
(actual time=1280.783..1280.783 rows=0 loops=1)
-> Bitmap Index Scan on test_v1_idx
(cost=0.00..4243.12 rows=202177 width=0) (actual time=644.702..644.702
rows=200715 loops=1)
Index Cond: ((v1 >= 480::double precision) AND
(v1 <= 500::double precision))
-> Bitmap Index Scan on test_v2_idx
(cost=0.00..4510.37 rows=214902 width=0) (actual time=630.085..630.085
rows=200200 loops=1)
Index Cond: ((v2 >= 480::double precision) AND
(v2 <= 500::double precision))
Total runtime: 3242.253 ms
(12 rows)

This query can be rewritten in order to let planner use gist cube index.

test=# select * from test where cube(array[v1,v2,v3]) <@
cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])
order by cube(array[v1,v2,v3]) <*> 4 limit 10;
id | v1 | v2 | v3
----------+------------------+------------------+-------------------
12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135
4936086 | 497.239370364696 | 491.878624074161 | 1.26481195911765
8963067 | 484.963194001466 | 497.094289399683 | 1.30057940259576
12435440 | 498.670902103186 | 498.667187988758 | 1.33110675960779
11667415 | 494.398592971265 | 497.440234292299 | 1.44533207640052
8530558 | 482.85893118009 | 496.267838869244 | 1.48530444130301
4004942 | 483.679085504264 | 489.547223784029 | 1.57393841072917
14897796 | 491.37338064611 | 487.475242000073 | 1.81775307282805
4105759 | 489.506138022989 | 486.91446846351 | 1.94038823246956
12895656 | 499.508572742343 | 487.065799534321 | 2.34963605180383
(10 rows)

test=# explain analyze select * from test where cube(array[v1,v2,v3]) <@
cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])
order by cube(array[v1,v2,v3]) <*> 4 limit 10;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..39.10 rows=10 width=28) (actual time=5.186..20.047
rows=10 loops=1)
-> Index Scan using test_cube_idx on test (cost=0.00..39100.88
rows=10000 width=28) (actual time=5.177..20.012 rows=10 loops=1)
Index Cond: (cube(ARRAY[v1, v2, v3]) <@ '(480, 480, -inf),(500,
500, inf)'::cube)
Sort Cond: (cube(ARRAY[v1, v2, v3]) <*> 4)
Total runtime: 20.145 ms
(5 rows)

So, the result is the same, but query runs about 150 time faster.
I think that usage of multidimensional index on ordinal data types can
produce following benefits:
1) It will be possible to create one multidimensional index instead of many
of unidimensional ones. (Of course, it will be slower on simple queries).
2) On some complex queries (like noted above) it can produce
great performance benefit.

And now there is my question. How huge changes in planner is needed to make
planner choose multidimensional index freely without rewriting query in
special manner? In my example I replaced "v1 between 480 and 500 and v2
between 480 and 500" by "cube(array[v1,v2,v3]) <@
cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])"
and "order by v3" by "order by cube(array[v1,v2,v3])", but it will be a
great option if planner could consider plan with gist cube index for
original query.

With best regards,
Alexander Korotkov.

Attachment Content-Type Size
cube.gz application/x-gzip 1.5 KB

From: Thom Brown <thombrown(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using multidimensional indexes in ordinal queries
Date: 2010-06-21 09:19:49
Message-ID: AANLkTilXl0dbV9v4aLN66iYKu5Q1ff6jvCZhp7X5ETAq@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6 June 2010 21:04, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> Hello hackers,
> I would like to share some my thoughts about usage of multidimensional
> indexes for queries which deal with ordinal unidimensional data types. I
> think that gist indexes (especially with knngist) can produce great benefit
> for complex multi-criterion queries.
> Let's consider come example. I use postgresql-9.0beta1 with knngist patch.
> Also I have created simple patch that allows to use knngist for ordinal
> sorting in cube extension (patch is attached). The "<*>" operator was
> introduced in my patch. The first operand is the cube and the second operand
> is number n. If n = 2*k then the ascending ordering by k-dimension occurs.
> If n = 2*k + 1 then descending ordering by k-dimension occurs. Now this
> operator have a limitation and works only with nonnegative coordinate
> values.
> Let's create table with 3 float-point columns and fill it with 10M rows;
> create table test (id serial primary key, v1 double precision, v2 double
> precision, v3 double precision);
> insert into test (v1,v2,v3) (select random()*1000, random()*1000,
> random()*1000 from generate_series(1,10000000,1));
> Now, let's create 3 separate btree indexes and one gist cube index.
> create index test_v1_idx on test(v1);
> create index test_v2_idx on test(v2);
> create index test_v3_idx on test(v3);
> create index test_cube_idx on test using gist(cube(ARRAY[v1,v2,v3]));
> Let's consider some complex query with filtering, ordering and limit.
> test=# select * from test where v1 between 480 and 500 and v2 between 480
> and 500 order by v3 limit 10;
>     id    |        v1        |        v2        |        v3
> ----------+------------------+------------------+-------------------
>  12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135
>   4936086 | 497.239370364696 | 491.878624074161 |  1.26481195911765
>   8963067 | 484.963194001466 | 497.094289399683 |  1.30057940259576
>  12435440 | 498.670902103186 | 498.667187988758 |  1.33110675960779
>  11667415 | 494.398592971265 | 497.440234292299 |  1.44533207640052
>   8530558 |  482.85893118009 | 496.267838869244 |  1.48530444130301
>   4004942 | 483.679085504264 | 489.547223784029 |  1.57393841072917
>  14897796 |  491.37338064611 | 487.475242000073 |  1.81775307282805
>   4105759 | 489.506138022989 |  486.91446846351 |  1.94038823246956
>  12895656 | 499.508572742343 | 487.065799534321 |  2.34963605180383
> (10 rows)
> test=# explain analyze select * from test where v1 between 480 and 500 and
> v2 between 480 and 500 order by v3 limit 10;
>
>  QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=22786.73..22786.75 rows=10 width=28) (actual
> time=3242.135..3242.162 rows=10 loops=1)
>    ->  Sort  (cost=22786.73..22797.59 rows=4345 width=28) (actual
> time=3242.131..3242.144 rows=10 loops=1)
>          Sort Key: v3
>          Sort Method:  top-N heapsort  Memory: 25kB
>          ->  Bitmap Heap Scan on test  (cost=8755.91..22692.83 rows=4345
> width=28) (actual time=1281.030..3234.934 rows=4027 loops=1)
>                Recheck Cond: ((v1 >= 480::double precision) AND (v1 <=
> 500::double precision) AND (v2 >= 480::double precision) AND (v2 <=
> 500::double precision))
>                ->  BitmapAnd  (cost=8755.91..8755.91 rows=4345 width=0)
> (actual time=1280.783..1280.783 rows=0 loops=1)
>                      ->  Bitmap Index Scan on test_v1_idx
>  (cost=0.00..4243.12 rows=202177 width=0) (actual time=644.702..644.702
> rows=200715 loops=1)
>                            Index Cond: ((v1 >= 480::double precision) AND
> (v1 <= 500::double precision))
>                      ->  Bitmap Index Scan on test_v2_idx
>  (cost=0.00..4510.37 rows=214902 width=0) (actual time=630.085..630.085
> rows=200200 loops=1)
>                            Index Cond: ((v2 >= 480::double precision) AND
> (v2 <= 500::double precision))
>  Total runtime: 3242.253 ms
> (12 rows)
> This query can be rewritten in order to let planner use gist cube index.
> test=# select * from test where cube(array[v1,v2,v3]) <@
> cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])
> order by cube(array[v1,v2,v3]) <*> 4 limit 10;
>     id    |        v1        |        v2        |        v3
> ----------+------------------+------------------+-------------------
>  12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135
>   4936086 | 497.239370364696 | 491.878624074161 |  1.26481195911765
>   8963067 | 484.963194001466 | 497.094289399683 |  1.30057940259576
>  12435440 | 498.670902103186 | 498.667187988758 |  1.33110675960779
>  11667415 | 494.398592971265 | 497.440234292299 |  1.44533207640052
>   8530558 |  482.85893118009 | 496.267838869244 |  1.48530444130301
>   4004942 | 483.679085504264 | 489.547223784029 |  1.57393841072917
>  14897796 |  491.37338064611 | 487.475242000073 |  1.81775307282805
>   4105759 | 489.506138022989 |  486.91446846351 |  1.94038823246956
>  12895656 | 499.508572742343 | 487.065799534321 |  2.34963605180383
> (10 rows)
> test=# explain analyze select * from test where cube(array[v1,v2,v3]) <@
> cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])
> order by cube(array[v1,v2,v3]) <*> 4 limit 10;
>                                                              QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..39.10 rows=10 width=28) (actual time=5.186..20.047
> rows=10 loops=1)
>    ->  Index Scan using test_cube_idx on test  (cost=0.00..39100.88
> rows=10000 width=28) (actual time=5.177..20.012 rows=10 loops=1)
>          Index Cond: (cube(ARRAY[v1, v2, v3]) <@ '(480, 480, -inf),(500,
> 500, inf)'::cube)
>          Sort Cond: (cube(ARRAY[v1, v2, v3]) <*> 4)
>  Total runtime: 20.145 ms
> (5 rows)
> So, the result is the same, but query runs about 150 time faster.
> I think that usage of multidimensional index on ordinal data types can
> produce following benefits:
> 1) It will be possible to create one multidimensional index instead of many
> of unidimensional ones. (Of course, it will be slower on simple queries).
> 2) On some complex queries (like noted above) it can produce
> great performance benefit.
> And now there is my question. How huge changes in planner is needed to make
> planner choose multidimensional index freely without rewriting query in
> special manner? In my example I replaced "v1 between 480 and 500 and v2
> between 480 and 500" by "cube(array[v1,v2,v3]) <@
> cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])"
> and "order by v3" by "order by cube(array[v1,v2,v3])", but it will be a
> great option if planner could consider plan with gist cube index for
> original query.
> With best regards,
> Alexander Korotkov.
>

I can't answer this, but is anyone else able to provide Alexander some feedback?

Thom


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thom Brown <thombrown(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using multidimensional indexes in ordinal queries
Date: 2010-06-21 13:42:47
Message-ID: AANLkTimd77AEV3-4eyh86ugvC0vVjMaoQApXWKEs5c6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 21, 2010 at 5:19 AM, Thom Brown <thombrown(at)gmail(dot)com> wrote:
> I can't answer this, but is anyone else able to provide Alexander some feedback?

It seems like you can get more or less the same benefit from a
multicolumn btree index. On my system, with the individual btree
indices, the query ran in 7625 ms; with an additional index on (v1,
v2, v3), it ran in 94 ms. I didn't get the same plans as Alexander
did, though, so it may not really be apples to apples. See attached
session trace.

Having said that, I'm very interested in hearing what other ideas
people have for using indices to speed up "ORDER BY" operations.
Currently, we support only "ORDER BY <indexed-value>". KNNGIST will
allow "ORDER BY <indexed-value> <op> <constant>", but why stop there?
In theory, an index might know how to order the data by any arbitrary
expression the user might happen to enter. If the user asks for
"ORDER BY (<indexed-value> <op1> <constan1t>) <op2> <constan2t>", who
is to say that we can't use an index scan to get that ordering
quickly? (As a trivial case, suppose both ops are +, but there could
easily be more interesting ones.) Or what about "ORDER BY
somefunc(<indexed-value>)"? The trouble is that it's hard to think of
a way of teaching the planner about these cases without hard-coding
lots and lots of special-case kludges into the planner. Still, if
someone has a clever idea...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
multicolumn-btree-index.txt text/plain 3.7 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thombrown(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using multidimensional indexes in ordinal queries
Date: 2010-06-21 21:20:31
Message-ID: AANLkTim8yryTqBV7rVucvN1h9SpIMgt8fFEmXJSO1_sE@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 21, 2010 at 5:42 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> It seems like you can get more or less the same benefit from a
> multicolumn btree index. On my system, with the individual btree
> indices, the query ran in 7625 ms; with an additional index on (v1,
> v2, v3), it ran in 94 ms. I didn't get the same plans as Alexander
> did, though, so it may not really be apples to apples. See attached
> session trace.
>
Benefit of multicolumn btree index was more or less the same than cube
benefit because of very bad picksplit behavior in this case. I attached the
patch which significally improves cube index search performance:

test=# explain (analyze, buffers) select * from test where
cube(ARRAY[v1,v2,v3]) <@ cube(ARRAY[480,480,'-inf'::float8],
ARRAY[500,500,'+inf'::float8]) order by cube(ARRAY[v1,v2,v3]) <*> 4 LIMIT
10;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..38.07 rows=10 width=28) (actual time=0.495..0.570
rows=10 loops=1)
Buffers: shared hit=21
-> Index Scan using test_cube_idx on test (cost=0.00..38064.52
rows=10000 width=28) (actual time=0.489..0.537 rows=10 loops=1)
Index Cond: (cube(ARRAY[v1, v2, v3]) <@ '(480, 480, -inf),(500,
500, inf)'::cube)
Sort Cond: (cube(ARRAY[v1, v2, v3]) <*> 4)
Buffers: shared hit=21
Total runtime: 0.659 ms
(7 rows)

Now this patch greatly increases tree construction time, but I believe that
picksplit implementation, that is good enough for tree search and tree
construction, can be found.

> The trouble is that it's hard to think of
> a way of teaching the planner about these cases without hard-coding
> lots and lots of special-case kludges into the planner. Still, if
> someone has a clever idea...

I think that two things can be done to improve the situation:
1) Make knngist deal with negative values. I think this will make easier
using knngist just for sorting, not only k-neighbor searching.
2) Let gist interface methods take care about multicolumn indexes. I think
that if cube index from the example above will be constructed on separate
columns v1, v2, v3 then it would be easier for planner to use cube index for
queries with filters on these columns. I don't know exactly how to do that.

Attachment Content-Type Size
cube.gz application/x-gzip 4.4 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Thom Brown <thombrown(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using multidimensional indexes in ordinal queries
Date: 2010-06-21 21:58:44
Message-ID: AANLkTikBHy_ilFO-19q_CMavsLjCWSUHDQEZhpHO5CXj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 21, 2010 at 5:20 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> 1) Make knngist deal with negative values. I think this will make easier
> using knngist just for sorting, not only k-neighbor searching.

It doesn't? I didn't think it was making any assumptions about the
ordering data type beyond the fact that it had a default btree
opclass.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thombrown(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using multidimensional indexes in ordinal queries
Date: 2010-06-22 19:35:45
Message-ID: AANLkTimVLJxvr5kcQFCiRHWBqQkSwyLZj-ZD-HjN_rMk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 22, 2010 at 1:58 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> It doesn't? I didn't think it was making any assumptions about the
> ordering data type beyond the fact that it had a default btree
> opclass.
>
Actually, the return type of consistent method was replaced by float8.
Negative values are used for "unconsistent" state. Non-negative values are
used for "consistent" and ordering.