Re: Cube extension improvement, GSoC

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Stas Kelvich <stanconn(at)gmail(dot)com>
Subject: Re: Cube extension improvement, GSoC
Date: 2013-03-30 11:55:33
Message-ID: CAPpHfdtsKPx7Kxy8eAAdD=H1V5Eh5H+TEz5sxUvF=twURgBAnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Jey!

I remember I've started couple of threads related to cube extension:
http://www.postgresql.org/message-id/4F30616D.3030509@gmail.com
http://www.postgresql.org/message-id/4F3C16E9.90808@gmail.com
Could you provide some feedback to GSoC proposal of Stas?

On Sat, Mar 23, 2013 at 3:10 AM, Stas Kelvich <stanconn(at)gmail(dot)com> wrote:

> Hello,
>
> some time ago I started working on the data search system (about 100-200M
> of records) with queries consisted of several diapason and equality
> conditions, e.g.:
>
> WHERE dim1 BETWEEN 128 AND 137 AND
> WHERE dim2 BETWEEN 4815 AND 162342 AND
> WHERE dim3 = 42
> ORDER BY dim1 ASC
>
> There are 6 or 7 search criteria in my task. In order to avoid full table
> scan I started using R-Tree over cube data type:
>
> CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))
>
> For fast sorting I used Alexander Korotkov's patch for knngist (
> http://www.mail-archive.com/pgsql-hackers(at)postgresql(dot)org/msg153676.html),
> which changes metric for nearest neighbors search and allows to obtain
> cubes ordered by a specific coordinate. Having changed some data types and
> function/operator definitions I ported this to 9.2 (
> https://github.com/kelvich/postgres/commit/bb372). Then, after this I
> could select ordered records right from the index:
>
> SELECT * FROM my_table
> WHERE cube(array[dim1, dim2, dim3]) <@ cube
> '(128,4815,42),(137,162342,42)'
> ORDER BY cube(array[dim1, dim2, dim3]) <*> 5;
>
> The main issue of such approach is the index size. In my case it was about
> 100GB for 100M of records. Therefore index building becomes very expensive
> disk-related operation. For the same reason reading a large number of
> records from the index is slow too.
>
> I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander
> Korotkov for any help. (I'm very thankful to them and glad that they agreed
> to meet, listen to me and give useful advices). Having discussed it we
> decided that there was several possible improvements for the cube extension:
>
> * Adding point data type support to the cube extension in order to avoid
> storing of coincident upper left and lower right vertices, which may reduce
> the volume that leaf nodes occupy almost twice.
> * Checking the split algorithm with big datasets and, if possible,
> improving it.
> * Learning cube extension to store dimensions with different data types.
> Such index would be good alternative to compound key B-Tree multi-index
> (suitable for diapason queries and data ordering).
> * Providing support for KNN with metrics induced by the different norms:
> euclidean, taxicab norm, p-norm. This can be useful in fields where we can
> extract signature: similar images search, similar audio search.
>
> I'd like to participate in GSoC with this improvements, and I'm very
> interested in any comments or suggestions about this feature list.
>

------
With best regards,
Alexander Korotkov.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message 赖文豫 2013-03-30 14:08:44 By now, why PostgreSQL 9.2 don't support SSDs?
Previous Message Alexander Korotkov 2013-03-30 08:29:52 Re: Getting to 9.3 beta