Re: KNN-GiST with recheck

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-01-28 14:12:39
Message-ID: CAPpHfdvn5jxgh-rVMz7c2vXoYW_W7dGrJjQkVAtaCr=-ZMCADA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 01/13/2014 07:17 PM, Alexander Korotkov wrote:
>
>> Here goes a desription of this patch same as in original thread.
>>
>> KNN-GiST provides ability to get ordered results from index, but this
>> order
>> is based only on index information. For instance, GiST index contains
>> bounding rectangles for polygons, and we can't get exact distance to
>> polygon from index (similar situation is in PostGIS). In attached patch,
>> GiST distance method can set recheck flag (similar to consistent method).
>> This flag means that distance method returned lower bound of distance and
>> we should recheck it from heap.
>>
>> See an example.
>>
>> create table test as (select id, polygon(3+(random()*10)::int,
>> circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
>> generate_series(1,1000000) id);
>> create index test_idx on test using gist (p);
>>
>> We can get results ordered by distance from polygon to point.
>>
>> postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
>> point(0.5,0.5) limit 10;
>> id | ?column?
>> --------+----------------------
>> 755611 | 0.000405855808916853
>> 807562 | 0.000464123777564343
>> 437778 | 0.000738524708741959
>> 947860 | 0.00076250998760724
>> 389843 | 0.000886362723569568
>> 17586 | 0.000981960100555216
>> 411329 | 0.00145338112316853
>> 894191 | 0.00149399559703506
>> 391907 | 0.0016647896049741
>> 235381 | 0.00167554614889509
>> (10 rows)
>>
>> It's fast using just index scan.
>>
>> QUERY PLAN
>>
>> ------------------------------------------------------------
>> ----------------------------------------------------------------------
>> Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
>> rows=10 loops=1)
>> -> Index Scan using test_idx on test (cost=0.29..157672.29
>> rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
>> Order By: (p <-> '(0.5,0.5)'::point)
>> Total runtime: 0.305 ms
>> (4 rows)
>>
>
> Nice! Some thoughts:
>
> 1. This patch introduces a new "polygon <-> point" operator. That seems
> useful on its own, with or without this patch.
>

Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.

2. I wonder how useful it really is to allow mixing exact and non-exact
> return values from the distance function. The distance function included in
> the patch always returns recheck=true. I have a feeling that all other
> distance function will also always return either true or false.
>

For geometrical datatypes recheck variations in consistent methods are also
very rare (I can't remember any). But imagine opclass for arrays where keys
have different representation depending on array length. For such opclass
and knn on similarity recheck flag could be useful.

> 3. A binary heap would be a better data structure to buffer the rechecked
> values. A Red-Black tree allows random insertions and deletions, but in
> this case you need to insert arbitrary values but only remove the minimum
> item. That's exactly what a binary heap excels at. We have a nice binary
> heap implementation in the backend that you can use, see
> src/backend/lib/binaryheap.c.
>

Hmm. For me binary heap would be a better data structure for KNN-GiST at
all :-)

> 4. (as you mentioned in the other thread: ) It's a modularity violation
> that you peek into the heap tuple from gist. I think the proper way to do
> this would be to extend the IndexScan executor node to perform the
> re-shuffling of tuples that come from the index in wrong order, or perhaps
> add a new node type for it.
>
> Of course that's exactly what your partial sort patch does :-). I haven't
> looked at that in detail, but I don't think the approach the partial sort
> patch takes will work here as is. In the KNN-GiST case, the index is
> returning tuples roughly in the right order, but a tuple that it returns
> might in reality belong somewhere later in the ordering. In the partial
> sort patch, the "input stream" of tuples is divided into non-overlapping
> groups, so that the tuples within the group are not sorted, but the groups
> are. I think the partial sort case is a special case of the KNN-GiST case,
> if you consider the lower bound of each tuple to be the leading keys that
> you don't need to sort.
>

Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message salah jubeh 2014-01-28 14:14:08 bison, flex and ./configure
Previous Message Robert Haas 2014-01-28 14:03:51 Re: Planning time in explain/explain analyze