Re: KNN searches support for SP-GiST [GSOC'14]

Lists: pgsql-hackers
From: Vladislav Sterzhanov <gliderok(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: KNN searches support for SP-GiST [GSOC'14]
Date: 2014-08-20 00:35:54
Message-ID: CAK2aJC0-Jhrb3UjOZL+arBCHoTVwbeJVpRN-JOfEnN0vA6+MZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi there, pg-Hackers!
Here I go with the patch which brings up the possibility to perform
nearest-neighbour searches on SP-GiSTs (as of now includes implementation
for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov.
Sample benchmarking script included in the attachment (dumps the current
geonames archive and runs several searches on the (latitude, longitude)
points), which demonstrates the dramatic improvements against plain
searches and sorting. Regression tests included, compiles and runs
successfully under both of my Ubuntu 12.04 Server and 08/2014 Arch Linux.

Vlad Sterzhanov / Quadrocube

Attachment Content-Type Size
knn_spgist.patch text/x-patch 68.8 KB
knn_spgist_benchmark.sh application/x-sh 4.9 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Vladislav Sterzhanov <gliderok(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN searches support for SP-GiST [GSOC'14]
Date: 2014-08-20 07:09:00
Message-ID: 53F4498C.2010404@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/20/2014 03:35 AM, Vladislav Sterzhanov wrote:
> Hi there, pg-Hackers!
> Here I go with the patch which brings up the possibility to perform
> nearest-neighbour searches on SP-GiSTs (as of now includes implementation
> for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov.
> Sample benchmarking script included in the attachment (dumps the current
> geonames archive and runs several searches on the (latitude, longitude)
> points), which demonstrates the dramatic improvements against plain
> searches and sorting.

Cool!

I think this needs a high-level description in access/spgist/README on
how this works. You can probably copy-paste the similar description from
gist's README, because it's the same design. If I understand correctly,
the support functions return distances between the nodes and the query,
and the SP-GiST code uses those distances to return the rows in order.
An important detail there is that the distance returned for an inner
node is a lower bound of the distance of any node in that branch.

A binary heap would be a better data structure than a red-black tree for
the queue of nodes to visit/return. I know that GiST also uses a
red-black for the same thing, but that's no reason not to do better
here. There is a binary heap implementation in the backend too (see
src/backend/lib/binaryheap.c), so it's not any more difficult to use.

Does the code handle ties correctly? The distance function is just an
approximation, so it's possible that there are two items with same
distance, but are not equal according to the ordering operator. As an
extreme example, if you hack the distance function to always return 0.0,
do you still get correct results (albeit slowly)?

The new suppLen field in spgConfigOut is not mentioned in the docs. Why
not support pass-by-value supplementary values?

A couple of quick comments on the code:

> @@ -137,8 +138,8 @@ spg_quad_picksplit(PG_FUNCTION_ARGS)
> {
> spgPickSplitIn *in = (spgPickSplitIn *) PG_GETARG_POINTER(0);
> spgPickSplitOut *out = (spgPickSplitOut *) PG_GETARG_POINTER(1);
> - int i;
> - Point *centroid;
> + int i;
> + Point *centroid;
>
> #ifdef USE_MEDIAN
> /* Use the median values of x and y as the centroid point */

Please remove all unnecessary whitespace changes like this.

> @@ -213,14 +215,19 @@ spg_quad_inner_consistent(PG_FUNCTION_ARGS)
> }
>
> Assert(in->nNodes == 4);
> +
> + if (in->norderbys > 0)
> + {
> + out->distances = palloc(in->nNodes * sizeof (double *));
> + }

I think that should be "sizeof(double)".

> + *distances = (double *) malloc(norderbys * sizeof (double *));

No mallocs allowed in the backend, should be palloc.

- Heikki


From: Thom Brown <thom(at)linux(dot)com>
To: Vladislav Sterzhanov <gliderok(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN searches support for SP-GiST [GSOC'14]
Date: 2014-10-27 11:37:17
Message-ID: CAA-aLv5RmDXUGNf-q5sf9iM_jwDNsM19UV5C=_OmUXkmUroWMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20 August 2014 08:09, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> On 08/20/2014 03:35 AM, Vladislav Sterzhanov wrote:
>
>> Hi there, pg-Hackers!
>> Here I go with the patch which brings up the possibility to perform
>> nearest-neighbour searches on SP-GiSTs (as of now includes implementation
>> for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov.
>> Sample benchmarking script included in the attachment (dumps the current
>> geonames archive and runs several searches on the (latitude, longitude)
>> points), which demonstrates the dramatic improvements against plain
>> searches and sorting.
>>
>
> Cool!
>
> I think this needs a high-level description in access/spgist/README on how
> this works. You can probably copy-paste the similar description from gist's
> README, because it's the same design. If I understand correctly, the
> support functions return distances between the nodes and the query, and the
> SP-GiST code uses those distances to return the rows in order. An important
> detail there is that the distance returned for an inner node is a lower
> bound of the distance of any node in that branch.
>
> A binary heap would be a better data structure than a red-black tree for
> the queue of nodes to visit/return. I know that GiST also uses a red-black
> for the same thing, but that's no reason not to do better here. There is a
> binary heap implementation in the backend too (see
> src/backend/lib/binaryheap.c), so it's not any more difficult to use.
>
> Does the code handle ties correctly? The distance function is just an
> approximation, so it's possible that there are two items with same
> distance, but are not equal according to the ordering operator. As an
> extreme example, if you hack the distance function to always return 0.0, do
> you still get correct results (albeit slowly)?
>
> The new suppLen field in spgConfigOut is not mentioned in the docs. Why
> not support pass-by-value supplementary values?
>
> A couple of quick comments on the code:
>
> @@ -137,8 +138,8 @@ spg_quad_picksplit(PG_FUNCTION_ARGS)
>> {
>> spgPickSplitIn *in = (spgPickSplitIn *) PG_GETARG_POINTER(0);
>> spgPickSplitOut *out = (spgPickSplitOut *) PG_GETARG_POINTER(1);
>> - int i;
>> - Point *centroid;
>> + int i;
>> + Point *centroid;
>>
>> #ifdef USE_MEDIAN
>> /* Use the median values of x and y as the centroid point */
>>
>
> Please remove all unnecessary whitespace changes like this.
>
> @@ -213,14 +215,19 @@ spg_quad_inner_consistent(PG_FUNCTION_ARGS)
>> }
>>
>> Assert(in->nNodes == 4);
>> +
>> + if (in->norderbys > 0)
>> + {
>> + out->distances = palloc(in->nNodes * sizeof (double *));
>> + }
>>
>
> I think that should be "sizeof(double)".
>
> + *distances = (double *) malloc(norderbys * sizeof (double *));
>>
>
> No mallocs allowed in the backend, should be palloc.

Hi Vlad,

I've been revisiting the status of the GSoC projects for this year and saw
this had stalled. Do you have time to address the remaining issues so we
can get your work committed?

Thanks

Thom