Re: Cube extension kNN support

Lists: pgsql-hackers
From: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Cube extension kNN support
Date: 2013-09-22 23:38:48
Message-ID: 9E07E159-E405-41E2-9889-A04F534FC257@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, hackers.

Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.

Following distance operators introduced:

<#> taxicab distance
<-> euclidean distance
<=> chebyshev distance

For example:
SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10;

Also there is operator "->" for selecting ordered rows directly from index.
This request selects rows ordered ascending by 3rd coordinate:

SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;

For descendent ordering suggested syntax with minus before coordinate.
This request selects rows ordered descending by 4th coordinate:

SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;

Stas Kelvich.

Attachment Content-Type Size
distances.patch application/octet-stream 10.0 KB

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2013-09-23 06:26:51
Message-ID: CAF4Au4ybJ_7Kr7whJVHunWjjjyTwZYaPuLr0Ny5ZfFschc8naA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Do you have any benchmarks ?

On Mon, Sep 23, 2013 at 3:38 AM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>wrote:

> Hello, hackers.
>
> Here is the patch that introduces kNN search for cubes with euclidean,
> taxicab and chebyshev distances.
>
> Following distance operators introduced:
>
> <#> taxicab distance
> <-> euclidean distance
> <=> chebyshev distance
>
> For example:
> SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube
> LIMIT 10;
>
> Also there is operator "->" for selecting ordered rows directly from index.
> This request selects rows ordered ascending by 3rd coordinate:
>
> SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;
>
> For descendent ordering suggested syntax with minus before coordinate.
> This request selects rows ordered descending by 4th coordinate:
>
> SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;
>
> Stas Kelvich.
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2013-09-25 13:25:54
Message-ID: 5242E462.40100@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9/22/13 7:38 PM, Stas Kelvich wrote:
> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.

cube and earthdistance regression tests fail.


From: Gordon Mohr <gojomo-pgsql(at)xavvy(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)
Date: 2013-10-24 01:50:34
Message-ID: 52687CEA.9060903@xavvy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9/22/13 4:38 PM, Stas Kelvich wrote:
> Hello, hackers.
>
> Here is the patch that introduces kNN search for cubes with
> euclidean, taxicab and chebyshev distances.

Thanks for this! I decided to give the patch a try at the bleeding edge
with some high-dimensional vectors, specifically the 1.4 million
1000-dimensional Freebase entity vectors from the Google 'word2vec' project:

https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Unfortunately, here's what I found:

* with 1000-dimensional vectors, the index size on disk grows to many
times (1000x or more) the size of the indexed data, making other tests
of the index's helpfulness impractical. (Perhaps, other parameter-tuning
can help?)

* with 500-dimensional or 100-dimensional vectors, the index size is
more manageable -- 2x to 6x the data size -- but using the index
significantly hurts performance on top-10 nearest-neighbor searches,
making it much slower than a full table scan. (The planner still tries
to use the index, even though it's hurting.)

Much more detail for the curious is below...

Regarding the dataset & hoped-for results:

The word2vec project's example scripts do their synonym/analogy
demonstrations by loading the entire 5GB+ dataset into main memory
(~3min), do a full scan of all vectors (~40sec) to find those nearest a
target vector.

My motivating question was: could the data be loaded to Postgresql using
the cube type, and kNN-GIST indexed using this patch, then do speedy
index-assisted ranked-nearest-neighbor queries from the database?

(While the patch's distance_euclid is not the same cosine-distance the
word2vec examples use, the freebase entity vectors are all unit vectors,
and even additions of vectors can be scaled to unit length. My intuition
is that euclidean-distances on the unit sphere will be in the same rank
order as cosine-distance, so the cube distance_euclid/<-> should enable
the same sort of synonym/analogy demos.)

Regarding the 1000-dimensional case:

It was necessary to change three compiled-in implementation limits. In
the four steps that the need for change became evident:

(1) in contrib/cube/cubedata.h, increase CUBE_MAX_DIM (to accept vectors
with more than 100 dimensions)

(2) in src/include/pg_config.h, increase BLCKSZ to 16384 (otherwise 1000
64-bit floats in a single column gave an <ERROR: row is too big>,
mentioning the 8160 limit - large cube values not TOASTable?)

(3) in src/include/access/itup.h, increase INDEX_SIZE_MASK to 0x3fff
(otherwise encountering an <ERROR: index row requires 16016 bytes,
maximum size is 8191> when attempting to create the index>

(4) in src/include/pg_config.h, again increase BLCKSZ now to 32768
(otherwise encountering an <ERROR: index row size 16416 exceeds maximum
5440 for index "pg_class_relname_nsp_index"> when attempting to create
the index>

With the cube-kNN patch applied and these other changes, I was able to
import the 1.4M freebase vectors and do a full-scan nearest-neighbors
query. (My starting postgresql codebase was the github mirror of 9.4dev
as of about a week ago.)

The psql transcript:

word2vec=# \timing
Timing is on.
word2vec=# CREATE EXTENSION cube;
CREATE EXTENSION
Time: 42.141 ms
word2vec=# CREATE TABLE word2vec ( word varchar(200), vec cube );
CREATE TABLE
Time: 8.532 ms
word2vec=# COPY word2vec FROM PROGRAM 'zcat
/tmp/pgvectors/freebase-vectors-skipgram1000.pgtxt.gz';
COPY 1422903
Time: 12399065.498 ms
word2vec=# SELECT word, dist FROM (SELECT word,
distance_euclid(vec,(SELECT vec FROM word2vec WHERE word='geoffrey
hinton')) AS dist FROM word2vec) AS subquery ORDER BY dist LIMIT 11;
word | dist
-------------------------+------------------
geoffrey hinton | 0
marvin minsky | 1.03892498287268
paul corkum | 1.05221701690288
william richard peltier | 1.06244397334495
brenda milner | 1.06387762685894
john charles polanyi | 1.07444446452295
leslie valiant | 1.07735786596934
hava siegelmann | 1.08148623006629
hans moravec | 1.0862034591185
david rumelhart | 1.08748431130477
godel prize | 1.08774264379264
(11 rows)

Time: 310753.976 ms

That's 3.5 hours to do the import and 5 minutes to do the query; this is
on a 2009 MacBook Pro with 8GB RAM and SSD.

Confirming the intuition above, these 10-nearest are the same entities
in the same order as on the word2vec project page's example output,
though the euclidean distances are of course different than the cosine
distances.

The DATA directory is 23GB after the import of 1.4 million rows. In the
word2vec uncompressed binary format, this dataset is about 5.4GB, so
this word2vec-table cube-column representation involves about 4X expansion.

So, to the main question: can that query be sped up by building a
kNN-GIST index? Here the problems start.

word2vec=# CREATE INDEX word2vec_index ON word2vec USING gist (vec);

This attempt ran for hours, consuming another 80GB+ before failing due
to disk full.

In fact, the largest table with 1000-dimensional vectors for which I was
able to build a gist index was a mere 100 rows. That index-build took
about 14 minutes and grew the DATA directory like so:

73MB pgdata # empty word2vec table
75MB pgdata # after COPY FROM of 100 vectors
8.4GB pgdata # after CREATE INDEX

Of course with just 100 rows the index isn't practically needed or
helpful for query speed-up.

Even trying just 500 rows, the CREATE INDEX command ran for hours before
failing by consuming all available disk space (about 90GB more). It
seems the knn-GIST index overhead for 1000-dimensional cubes grows
faster than linearly in the number of rows.

I was able to complete index builds with fewer dimensions.

100-dimensions, most-frequent 850K entities:

The dataset was trimmed to the first (most-frequent) 850K entities and
each vector truncated to its first 100-dimensions. (Even though this
would fit in the usual CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK limits, I
still used the customized build with limits expanded for the 1000d-case.)

Observed data sizes and operation times were:

73MB pgdata # empty word2vec table
2.6GB pgdata # after COPY FROM of 850K vectors, taking 216s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

5.0GB pgdata # after CREATE INDEX, taking 1344s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~57s # "Index Scan using word2vec_index…"

So the availability of the index causes a significant slowdown... and
the planner does not learn to choose the faster full sequential-scan.

500-dimensions, most-frequent 100K entities:

The dataset was trimmed to the first (most-frequent) 100K entities and
each vector truncated to its first 500-dimensions. (Again, still using
the custom build with upped CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK.)

Observed data sizes and operation times were:

73MB pgdata # empty word2vec table
1.6GB pgdata # after COPY FROM of 100K vectors, taking 266s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

4.8GB pgdata # after CREATE INDEX, taking 977s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~46s # "Index Scan using word2vec_index…"

Dropping the index makes the <-> query fast again.

Open questions and tentative conclusions:

The massive knn-GIST index overhead for 1000-dimensional vectors makes
it hard to evaluate whether a fully-built index could be useful on large
datasets.

Perhaps, some aspects of the knn-GIST support implicitly assume a
low-dimensionality (2-4) in the data, and large numbers of dimensions
cause pathological index sizes?

Or, something specific about this dataset (all vectors on the unit
sphere) is a challenging case?

In the truncated 100d or 500d cases, the index can be built, but slows
rather than speeds nearest-neighbor queries that use the index.

The evaluated patch's definitions of distance_euclid and other support
function seem straightforward, and give proper results on simple test
cases... so the problem, if any, would appear to be in the general
distance-driven knn-GIST indexing and query-planning.

If there is a type of cube-type represented dimensional data where this
indexing helps, it may only be with far fewer dimensions than 100,
and/or far more rows than 100K/850K.

If a knn-GIST implementor/expert has suggestions for tuning the index
overhead and behavior -- perhaps a different penalty or picksplit
function? -- I'd be happy to try those out and report back.

Otherwise, I hope this write-up will help evaluate the patch or save
time for others tempted to try similar knn-GIST indexing of
higher-dimensional data.

- Gordon Mohr


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Gordon Mohr <gojomo-pgsql(at)xavvy(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)
Date: 2013-10-24 04:05:29
Message-ID: 20131024040529.GB6832@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gordon Mohr wrote:

> Thanks for this! I decided to give the patch a try at the bleeding
> edge with some high-dimensional vectors, specifically the 1.4
> million 1000-dimensional Freebase entity vectors from the Google
> 'word2vec' project:
>
> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
>
> Unfortunately, here's what I found:

I wonder if these results would improve with this patch:
http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Gordon Mohr <gojomo(at)xavvy(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)
Date: 2013-10-26 22:14:15
Message-ID: 526C3EB7.3080002@xavvy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/23/13 9:05 PM, Alvaro Herrera wrote:
> Gordon Mohr wrote:
>
>> Thanks for this! I decided to give the patch a try at the bleeding
>> edge with some high-dimensional vectors, specifically the 1.4
>> million 1000-dimensional Freebase entity vectors from the Google
>> 'word2vec' project:
>>
>> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
>>
>> Unfortunately, here's what I found:
>
> I wonder if these results would improve with this patch:
> http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com

Thanks for the pointer; I'd missed that relevant update from Stas
Kelvich. I applied that patch, and reindexed.

On the 100-dimension, 850K vector set:

indexing: 1137s (vs. 1344s)
DATA size: 4.7G (vs 5.0G)
top-11-nearest-neighbor query: 32s (vs ~57s)

On the 500-dimension, 100K vector set:

indexing: 756s (vs. 977s)
DATA size: 4.5G (vs. 4.8G)
top-11-nearest-neighbor query: 18s (vs ~46s)

So, moderate (5-20%) improvements in indexing time and size, and larger
(40-60%) speedups in index-assisted (<->) queries... but those
index-assisted queries are still ~10X+ slower than the sequence-scan
(distance_euclid()) queries, so the existence of the knn-GIST index is
still harming rather than hurting performance.

Will update if my understanding changes; still interested to hear if
I've missed a key factor/switch needed for these indexes to work well.

- Gordon Mohr


From: Gordon Mohr <gojomo-pgsql(at)xavvy(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)
Date: 2013-10-26 22:58:07
Message-ID: 526C48FF.4000704@xavvy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/23/13 9:05 PM, Alvaro Herrera wrote:
> Gordon Mohr wrote:
>
>> Thanks for this! I decided to give the patch a try at the bleeding
>> edge with some high-dimensional vectors, specifically the 1.4
>> million 1000-dimensional Freebase entity vectors from the Google
>> 'word2vec' project:
>>
>> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
>>
>> Unfortunately, here's what I found:
>
> I wonder if these results would improve with this patch:
> http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com

Thanks for the pointer; I'd missed that relevant update from Stas
Kelvich. I applied that patch, and reindexed.

On the 100-dimension, 850K vector set:

indexing: 1137s (vs. 1344s)
DATA size: 4.7G (vs 5.0G)
top-11-nearest-neighbor query: 32s (vs ~57s)

On the 500-dimension, 100K vector set:

indexing: 756s (vs. 977s)
DATA size: 4.5G (vs. 4.8G)
top-11-nearest-neighbor query: 18s (vs ~46s)

So, moderate (5-20%) improvements in indexing time and size, and larger
(40-60%) speedups in index-assisted (<->) queries... but those
index-assisted queries are still ~10X+ slower than the sequence-scan
(distance_euclid()) queries, so the existence of the knn-GIST index is
still harming rather than hurting performance.

Will update if my understanding changes; still interested to hear if
I've missed a key factor/switch needed for these indexes to work well.

- Gordon Mohr


From: Marcin Mańk <marcin(dot)mank(at)gmail(dot)com>
To: Gordon Mohr <gojomo-pgsql(at)xavvy(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)
Date: 2013-10-27 20:43:54
Message-ID: CAK61fk4gh8qRc_0+yig4VnjCPpizUt-dq=dguxUVQ-D=Ztx_Ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 24, 2013 at 3:50 AM, Gordon Mohr <gojomo-pgsql(at)xavvy(dot)com> wrote:

> On 9/22/13 4:38 PM, Stas Kelvich wrote:
>
>> Hello, hackers.
>>
>> Here is the patch that introduces kNN search for cubes with
>> euclidean, taxicab and chebyshev distances.
>>
>
> Thanks for this! I decided to give the patch a try at the bleeding edge
> with some high-dimensional vectors, specifically the 1.4 million
> 1000-dimensional Freebase entity vectors from the Google 'word2vec' project:
>

I believe the curse of dimensionality is affecting you here. I think it is
impossible to get an improvement over sequential scan for 1000 dimensional
vectors. Read here:

http://en.wikipedia.org/wiki/Curse_of_dimensionality#k-nearest_neighbor_classification

Regards
Marcin Mańk


From: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>, obartunov(at)gmail(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2013-12-03 23:10:37
Message-ID: F1F7C601-8848-46DF-9E2D-A9004A7B8033@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi.

> cube and earthdistance regression tests fail.

Code updated to work with current HEAD. Also added tests to cover new functionality.

> Do you have any benchmarks ?

This patch just introduces functionality of calculating distances between cubes, so this code don't interfere much with kNN search speed. I think it's better to publish such benchmarks in neighbor patch about split algorithm.
Anyway, we can compare kNN with b-tree and full scan:

create table test(a1 float, a2 float, a3 float);
insert into test (select 100*random(), 100*random(), 100*random() from generate_series(1,1000000) as s(a));
create index on test using gist(cube(array[a1,a2,a3]));
select * from test order by a1 limit 15; -- 227.658 ms
select * from test order by cube(array[a1,a2,a3])->1 limit 15; -- 1.275 ms
create index on test(a1);
select * from test order by a1 limit 15; -- 0.103 ms

As we can see, kNN ordering 10 times slower than B-tree (on silly request for R-Tree, just as example), but still 100+ times faster than full scan on this table.

Stas.

On Sep 25, 2013, at 5:25 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:

> On 9/22/13 7:38 PM, Stas Kelvich wrote:
>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
>
> cube and earthdistance regression tests fail.

Attachment Content-Type Size
distances.patch application/octet-stream 45.6 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, obartunov(at)gmail(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2013-12-11 22:08:03
Message-ID: 52A8E243.2030503@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

cube.c: In function ‘g_cube_distance’:
cube.c:1453:2: warning: ‘retval’ may be used uninitialized in this function [-Wmaybe-uninitialized]


From: Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2014-03-27 22:26:48
Message-ID: CAL_0b1tPiXjeFZ+_1aizE4gTg0pF7iEHqjArL3ibRj9ANnKnkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi everyone,

On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com> wrote:
> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.

What is the status of this patch?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray(dot)ru(at)gmail(dot)com


From: Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, aekorotkov(at)gmail(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2014-03-31 18:01:56
Message-ID: CAL_0b1uaGVBnsqx-G6G2Q+N6eGppG89ry6FF0n=yeL+yq9aBoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com> wrote:
> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com> wrote:
>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
>
> What is the status of this patch?

Referring to our private conversation with Alexander Korotkov, the
patch is in WIP state currently, and, hopefully, will be ready by 9.5.
I'm ready to actively participate in its testing on a real world
production set of data.

I'm not sure if it is doable at all, but are there any possibility to
implement here, or, what would be just great, any ready/half ready
solutions of a Hamming distance based kNN search?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray(dot)ru(at)gmail(dot)com


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
Cc: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2014-03-31 19:09:13
Message-ID: CAPpHfdswD4vxFboXg99AWuqgWbFoXed7b+V8pw-vYx4VgUxk8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 31, 2014 at 10:01 PM, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com> wrote:

> On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
> wrote:
> > On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
> wrote:
> >> Here is the patch that introduces kNN search for cubes with euclidean,
> taxicab and chebyshev distances.
> >
> > What is the status of this patch?
>
> Referring to our private conversation with Alexander Korotkov, the
> patch is in WIP state currently, and, hopefully, will be ready by 9.5.
> I'm ready to actively participate in its testing on a real world
> production set of data.
>
> I'm not sure if it is doable at all, but are there any possibility to
> implement here, or, what would be just great, any ready/half ready
> solutions of a Hamming distance based kNN search?

Cube dealing with float8 numbers. There is another patch making it work
with other number types. But Hamming distance is for bit vectors, isn't it?

----
With best regards,
Alexander Korotkov.


From: Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2014-03-31 19:45:37
Message-ID: CAL_0b1uJ=v-eeq+zzJg0XgKvG-zNw+7--rvWMqjSSDomd=N6Uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 31, 2014 at 12:09 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
>> I'm not sure if it is doable at all, but are there any possibility to
>> implement here, or, what would be just great, any ready/half ready
>> solutions of a Hamming distance based kNN search?
>
> Cube dealing with float8 numbers. There is another patch making it work with
> other number types. But Hamming distance is for bit vectors, isn't it?

It can be generalized as for character vectors. Though, I agree, that
was an off topic question in some extent. I was wondering if there
were any postgres related indexable Hamming/Manhattan distance
experiments/thoughts/discussions, if kNN can be used here or not,
because from my understanding it can be represented as spatial (I
might be very wrong here).

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray(dot)ru(at)gmail(dot)com


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
Cc: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, aekorotkov(at)gmail(dot)com, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2015-05-09 02:20:49
Message-ID: 20150509022049.GR2523@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sergey Konoplev wrote:
> On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com> wrote:
> > On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com> wrote:
> >> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
> >
> > What is the status of this patch?
>
> Referring to our private conversation with Alexander Korotkov, the
> patch is in WIP state currently, and, hopefully, will be ready by 9.5.
> I'm ready to actively participate in its testing on a real world
> production set of data.

This patch doesn't seem to have received an updated version. Should we
just punt on it? The assumption would be that Stas or Alexander will be
re-submitting this for 9.6.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, aekorotkov(at)gmail(dot)com, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2015-05-09 03:53:22
Message-ID: AE03FA8B-BF45-4523-8CEC-E93F4A01E337@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and version bump.
Attaching a version with all migration stuff.

Attachment Content-Type Size
distances2r4.patch application/octet-stream 66.2 KB
unknown_filename text/plain 995 bytes

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2015-07-09 13:40:19
Message-ID: CAPpHfdsN4kkhgmbe4DsLHRHhCohBEK5SnUrtX2EB4FSrstiLUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com> wrote:

> Patch is pretty ready, last issue was about changed extension interface,
> so there should be migration script and version bump.
> Attaching a version with all migration stuff.
>

I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


From: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2015-10-30 01:51:05
Message-ID: 14876F35-0A31-444C-9914-E4C1978BDD80@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello.

That is updated version of the patch with proper update scripts.

Also i’ve noted that documentation states the wrong thing:

“It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap values if needed to create a uniform "lower left — upper right" internal representation."

But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension directly from index.
As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. First one returns
coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form.

Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with already existing data.

> On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> Hi!
>
> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com> wrote:
> Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and version bump.
> Attaching a version with all migration stuff.
>
> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
distances.patch application/octet-stream 68.9 KB

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2015-12-01 14:52:39
Message-ID: 565DB437.1070606@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Patch looks good, but there ara some review notices:
1 gmake installcheck fails:
*** /.../pgsql/contrib/cube/expected/cube_1.out 2015-12-01 17:49:01.768764000
+0300
--- /.../pgsql/contrib/cube/results/cube.out 2015-12-01 17:49:12.190818000
+0300
***************
*** 1382,1388 ****
(1 row)

-- Test of distances
! --
SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);
cube_distance
---------------
--- 1382,1388 ----
(1 row)

-- Test of distances
! --
SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);
cube_distance

Seems, there a extra space at the end of string

2 Pls, don use in C-code magick constants like 'case 16:'. Use macros to define
some human-readable name (g_cube_distance())

3 Switch in g_cube_distance(): default switch path should generate a error. It
just simplifies a degbug process, may be in future.

4 Docs: pls, don't use a strings with unlimited length.

Stas Kelvich wrote:
> Hello.
>
> That is updated version of the patch with proper update scripts.
>
> Also i’ve noted that documentation states the wrong thing:
>
> “It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap values if needed to create a uniform "lower left — upper right" internal representation."
>
> But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension directly from index.
> As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. First one returns
> coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form.
>
> Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with already existing data.
>
>> On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>
>> Hi!
>>
>> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com> wrote:
>> Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and version bump.
>> Attaching a version with all migration stuff.
>>
>> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
>>
>> ------
>> Alexander Korotkov
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>
> Stas Kelvich
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension kNN support
Date: 2015-12-07 14:47:54
Message-ID: 149476ED-5CAA-40AE-8A0C-DE1BEA7B293E@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, fixed.

Attachment Content-Type Size
cube_distances.diff application/octet-stream 70.0 KB
unknown_filename text/plain 3.0 KB

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cube extension kNN support
Date: 2015-12-15 18:46:03
Message-ID: 56705FEB.7080607@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 12/07/2015 03:47 PM, Stas Kelvich wrote:
> Hello, fixed.

I've looked at the patch today, seems mostly fine to me.

Three comments though:

1) (nitpicking) There seem to be some minor whitespace issues, i.e.
trailing spaces, empty lines being added/removed, etc.

2) one of the regression tests started to fail

SELECT '-1e-700'::cube AS cube;

This used to return (0) but now I get (-0). As this query existed in
1.0, it's probably due to change in the C code. Now sure where.

3) I wonder why the docs were changed like this:

<para>
- It does not matter which order the opposite corners of a cube are
- entered in. The <type>cube</> functions
- automatically swap values if needed to create a uniform
- <quote>lower left &mdash; upper right</> internal representation.
+ When corners coincide cube stores only one corner along with a
special flag in order to reduce size wasted.
</para>

Was the old behavior removed? I don't think so - it seems to behave
as before, so why to remove this information? Maybe it's not useful?
But then why add the bit about optimizing storage of points?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cube extension kNN support
Date: 2015-12-16 12:26:35
Message-ID: 32A833C3-9690-45CA-92AB-A16858F8A98A@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, thanks for the review.

> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
> trailing spaces, empty lines being added/removed, etc.

Fixed, I think

> 2) one of the regression tests started to fail
>
> SELECT '-1e-700'::cube AS cube;
>
> This used to return (0) but now I get (-0).

Actually that problem emerged because of the first problem. I had extra whitespace in sql file and removed that whitespace from one of the answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was one line length and you saw diff with cube.sql.
In all systems that available to me (osx/linux/freebsd) I saw that right answers file is cube_1.sql. But in other OS’es you can get +/- 0 or e27/e027. I edited that answers files manually, so there probably can be some other typos.

> 3) I wonder why the docs were changed like this:
>
> <para>
> - It does not matter which order the opposite corners of a cube are
> - entered in. The <type>cube</> functions
> - automatically swap values if needed to create a uniform
> - <quote>lower left &mdash; upper right</> internal representation.
> + When corners coincide cube stores only one corner along with a
> special flag in order to reduce size wasted.
> </para>
>
> Was the old behavior removed? I don't think so - it seems to behave
> as before, so why to remove this information? Maybe it's not useful?
> But then why add the bit about optimizing storage of points?

I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap coordinates.
Simple way to see it:
> select '(1,3),(3,1)'::cube;
cube
---------------
(1, 3),(3, 1)

But LowerLeft-UpperRight representation should be (1,1),(3,3)

Updated patch attached.

Attachment Content-Type Size
cube_distances.patch application/octet-stream 66.4 KB
unknown_filename text/plain 1.6 KB

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cube extension kNN support
Date: 2015-12-16 13:46:52
Message-ID: 56716B4C.7000001@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 12/16/2015 01:26 PM, Stas Kelvich wrote:
> Hi, thanks for the review.
>
>> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>> trailing spaces, empty lines being added/removed, etc.
>
>
> Fixed, I think
>
>> 2) one of the regression tests started to fail
>>
>> SELECT '-1e-700'::cube AS cube;
>>
>> This used to return (0) but now I get (-0).
>
> Actually that problem emerged because of the first problem. I had
extra whitespace in sql file and removed that whitespace from one of the
answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was
one line length and you saw diff with cube.sql.
> In all systems that available to me (osx/linux/freebsd) I saw that
right answers file is cube_1.sql. But in other OS’es you can get +/- 0
or e27/e027. I edited that answers files manually, so there probably can
be some other typos.

Ah! So that's why I couldn't quickly find the issue in the C code ...

>
>> 3) I wonder why the docs were changed like this:
>>
>> <para>
>> - It does not matter which order the opposite corners of a cube are
>> - entered in. The <type>cube</> functions
>> - automatically swap values if needed to create a uniform
>> - <quote>lower left &mdash; upper right</> internal representation.
>> + When corners coincide cube stores only one corner along with a
>> special flag in order to reduce size wasted.
>> </para>
>>
>> Was the old behavior removed? I don't think so - it seems to behave
>> as before, so why to remove this information? Maybe it's not useful?
>> But then why add the bit about optimizing storage of points?
>
> I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap coordinates.
> Simple way to see it:
>> select '(1,3),(3,1)'::cube;
> cube
> ---------------
> (1, 3),(3, 1)
>
> But LowerLeft-UpperRight representation should be (1,1),(3,3)

I don't think that's what the comment says, actually. It rather refers
to code like this:

result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1));

i.e. if you specifically ask for a particular corner (ll, in this case),
you'll get the proper value.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cube extension kNN support
Date: 2015-12-16 14:31:33
Message-ID: 75E00538-B701-4CAD-95F2-69A9FBD78B0F@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I don't think that's what the comment says, actually. It rather refers to code like this:
>
> result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1));
>
> i.e. if you specifically ask for a particular corner (ll, in this case), you'll get the proper value.

Hmm, I was confused by phrase “create a uniform _internal_ representation” and actually internally cube stored “as is”. But probably i just misinterpret that.
So here is the updated version with old documentation restored.

Attachment Content-Type Size
cube_distances.patch application/octet-stream 66.3 KB
unknown_filename text/plain 2.4 KB