Re: KNN-GiST with recheck

Lists: pgsql-hackers
From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: KNN-GiST with recheck
Date: 2014-01-13 17:17:12
Message-ID: CAPpHfdu_qBLNRnv-r=_tofZYYa6-r=Z5_MGF4_phAOkWcYxfRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers!

This patch was split from thread:
http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com

I've split it to separate thead, because it's related to partial sort only
conceptually not technically. Also I renamed it to "knn-gist-recheck" from
"partial-knn" as more appropriate name. In the attached version docs are
updated. Possible weak point of this patch design is that it fetches heap
tuple from GiST scan. However, I didn't receive any notes about its design,
so, I'm going to put it to commitfest.

Here goes a desription of this patch same as in original thread.

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
id | ?column?
--------+----------------------
755611 | 0.000405855808916853
807562 | 0.000464123777564343
437778 | 0.000738524708741959
947860 | 0.00076250998760724
389843 | 0.000886362723569568
17586 | 0.000981960100555216
411329 | 0.00145338112316853
894191 | 0.00149399559703506
391907 | 0.0016647896049741
235381 | 0.00167554614889509
(10 rows)

It's fast using just index scan.

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-1.patch.gz application/x-gzip 7.1 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-01-28 13:54:53
Message-ID: 52E7B6AD.9090608@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/13/2014 07:17 PM, Alexander Korotkov wrote:
> Here goes a desription of this patch same as in original thread.
>
> KNN-GiST provides ability to get ordered results from index, but this order
> is based only on index information. For instance, GiST index contains
> bounding rectangles for polygons, and we can't get exact distance to
> polygon from index (similar situation is in PostGIS). In attached patch,
> GiST distance method can set recheck flag (similar to consistent method).
> This flag means that distance method returned lower bound of distance and
> we should recheck it from heap.
>
> See an example.
>
> create table test as (select id, polygon(3+(random()*10)::int,
> circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
> generate_series(1,1000000) id);
> create index test_idx on test using gist (p);
>
> We can get results ordered by distance from polygon to point.
>
> postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
> point(0.5,0.5) limit 10;
> id | ?column?
> --------+----------------------
> 755611 | 0.000405855808916853
> 807562 | 0.000464123777564343
> 437778 | 0.000738524708741959
> 947860 | 0.00076250998760724
> 389843 | 0.000886362723569568
> 17586 | 0.000981960100555216
> 411329 | 0.00145338112316853
> 894191 | 0.00149399559703506
> 391907 | 0.0016647896049741
> 235381 | 0.00167554614889509
> (10 rows)
>
> It's fast using just index scan.
>
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
> rows=10 loops=1)
> -> Index Scan using test_idx on test (cost=0.29..157672.29
> rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
> Order By: (p <-> '(0.5,0.5)'::point)
> Total runtime: 0.305 ms
> (4 rows)

Nice! Some thoughts:

1. This patch introduces a new "polygon <-> point" operator. That seems
useful on its own, with or without this patch.

2. I wonder how useful it really is to allow mixing exact and non-exact
return values from the distance function. The distance function included
in the patch always returns recheck=true. I have a feeling that all
other distance function will also always return either true or false.

3. A binary heap would be a better data structure to buffer the
rechecked values. A Red-Black tree allows random insertions and
deletions, but in this case you need to insert arbitrary values but only
remove the minimum item. That's exactly what a binary heap excels at. We
have a nice binary heap implementation in the backend that you can use,
see src/backend/lib/binaryheap.c.

4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to
do this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, or
perhaps add a new node type for it.

Of course that's exactly what your partial sort patch does :-). I
haven't looked at that in detail, but I don't think the approach the
partial sort patch takes will work here as is. In the KNN-GiST case, the
index is returning tuples roughly in the right order, but a tuple that
it returns might in reality belong somewhere later in the ordering. In
the partial sort patch, the "input stream" of tuples is divided into
non-overlapping groups, so that the tuples within the group are not
sorted, but the groups are. I think the partial sort case is a special
case of the KNN-GiST case, if you consider the lower bound of each tuple
to be the leading keys that you don't need to sort.

BTW, this capability might also be highly useful for the min/max indexes
as well. A min/max index cannot return an exact ordering of tuples, but
it can also give a lower bound for a group of tuples.

- Heikki


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-01-28 14:12:39
Message-ID: CAPpHfdvn5jxgh-rVMz7c2vXoYW_W7dGrJjQkVAtaCr=-ZMCADA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 01/13/2014 07:17 PM, Alexander Korotkov wrote:
>
>> Here goes a desription of this patch same as in original thread.
>>
>> KNN-GiST provides ability to get ordered results from index, but this
>> order
>> is based only on index information. For instance, GiST index contains
>> bounding rectangles for polygons, and we can't get exact distance to
>> polygon from index (similar situation is in PostGIS). In attached patch,
>> GiST distance method can set recheck flag (similar to consistent method).
>> This flag means that distance method returned lower bound of distance and
>> we should recheck it from heap.
>>
>> See an example.
>>
>> create table test as (select id, polygon(3+(random()*10)::int,
>> circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
>> generate_series(1,1000000) id);
>> create index test_idx on test using gist (p);
>>
>> We can get results ordered by distance from polygon to point.
>>
>> postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
>> point(0.5,0.5) limit 10;
>> id | ?column?
>> --------+----------------------
>> 755611 | 0.000405855808916853
>> 807562 | 0.000464123777564343
>> 437778 | 0.000738524708741959
>> 947860 | 0.00076250998760724
>> 389843 | 0.000886362723569568
>> 17586 | 0.000981960100555216
>> 411329 | 0.00145338112316853
>> 894191 | 0.00149399559703506
>> 391907 | 0.0016647896049741
>> 235381 | 0.00167554614889509
>> (10 rows)
>>
>> It's fast using just index scan.
>>
>> QUERY PLAN
>>
>> ------------------------------------------------------------
>> ----------------------------------------------------------------------
>> Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
>> rows=10 loops=1)
>> -> Index Scan using test_idx on test (cost=0.29..157672.29
>> rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
>> Order By: (p <-> '(0.5,0.5)'::point)
>> Total runtime: 0.305 ms
>> (4 rows)
>>
>
> Nice! Some thoughts:
>
> 1. This patch introduces a new "polygon <-> point" operator. That seems
> useful on its own, with or without this patch.
>

Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.

2. I wonder how useful it really is to allow mixing exact and non-exact
> return values from the distance function. The distance function included in
> the patch always returns recheck=true. I have a feeling that all other
> distance function will also always return either true or false.
>

For geometrical datatypes recheck variations in consistent methods are also
very rare (I can't remember any). But imagine opclass for arrays where keys
have different representation depending on array length. For such opclass
and knn on similarity recheck flag could be useful.

> 3. A binary heap would be a better data structure to buffer the rechecked
> values. A Red-Black tree allows random insertions and deletions, but in
> this case you need to insert arbitrary values but only remove the minimum
> item. That's exactly what a binary heap excels at. We have a nice binary
> heap implementation in the backend that you can use, see
> src/backend/lib/binaryheap.c.
>

Hmm. For me binary heap would be a better data structure for KNN-GiST at
all :-)

> 4. (as you mentioned in the other thread: ) It's a modularity violation
> that you peek into the heap tuple from gist. I think the proper way to do
> this would be to extend the IndexScan executor node to perform the
> re-shuffling of tuples that come from the index in wrong order, or perhaps
> add a new node type for it.
>
> Of course that's exactly what your partial sort patch does :-). I haven't
> looked at that in detail, but I don't think the approach the partial sort
> patch takes will work here as is. In the KNN-GiST case, the index is
> returning tuples roughly in the right order, but a tuple that it returns
> might in reality belong somewhere later in the ordering. In the partial
> sort patch, the "input stream" of tuples is divided into non-overlapping
> groups, so that the tuples within the group are not sorted, but the groups
> are. I think the partial sort case is a special case of the KNN-GiST case,
> if you consider the lower bound of each tuple to be the leading keys that
> you don't need to sort.
>

Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.

------
With best regards,
Alexander Korotkov.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-01-28 17:32:35
Message-ID: 52E7E9B3.6030101@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
> On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> 4. (as you mentioned in the other thread: ) It's a modularity violation
>> that you peek into the heap tuple from gist. I think the proper way to do
>> this would be to extend the IndexScan executor node to perform the
>> re-shuffling of tuples that come from the index in wrong order, or perhaps
>> add a new node type for it.
>>
>> Of course that's exactly what your partial sort patch does :-). I haven't
>> looked at that in detail, but I don't think the approach the partial sort
>> patch takes will work here as is. In the KNN-GiST case, the index is
>> returning tuples roughly in the right order, but a tuple that it returns
>> might in reality belong somewhere later in the ordering. In the partial
>> sort patch, the "input stream" of tuples is divided into non-overlapping
>> groups, so that the tuples within the group are not sorted, but the groups
>> are. I think the partial sort case is a special case of the KNN-GiST case,
>> if you consider the lower bound of each tuple to be the leading keys that
>> you don't need to sort.
>
> Yes. But, for instance btree accesses heap for unique checking. Is really
> it so crimilal? :-)

Well, it is generally considered an ugly hack in b-tree too. I'm not
100% opposed to doing such a hack in GiST, but would very much prefer
not to.

> This is not only question of a new node or extending existing node. We need
> to teach planner/executor access method can return value of some expression
> which is lower bound of another expression. AFICS now access method can
> return only original indexed datums and TIDs. So, I afraid that enormous
> infrastructure changes are required. And I can hardly imagine what they
> should look like.

Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along
with xs_itup. Or as an attribute of xs_itup itself.

- Heikki


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-02-03 16:02:07
Message-ID: CAPpHfdtpam+Qah0U6R4_7wf+GFa+UDJ3B2603DcdXS6FYVe7+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 28, 2014 at 9:32 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
>
>> On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com
>>
>>> wrote:
>>>
>>
>> 4. (as you mentioned in the other thread: ) It's a modularity violation
>>> that you peek into the heap tuple from gist. I think the proper way to do
>>> this would be to extend the IndexScan executor node to perform the
>>> re-shuffling of tuples that come from the index in wrong order, or
>>> perhaps
>>> add a new node type for it.
>>>
>>> Of course that's exactly what your partial sort patch does :-). I haven't
>>> looked at that in detail, but I don't think the approach the partial sort
>>> patch takes will work here as is. In the KNN-GiST case, the index is
>>> returning tuples roughly in the right order, but a tuple that it returns
>>> might in reality belong somewhere later in the ordering. In the partial
>>> sort patch, the "input stream" of tuples is divided into non-overlapping
>>> groups, so that the tuples within the group are not sorted, but the
>>> groups
>>> are. I think the partial sort case is a special case of the KNN-GiST
>>> case,
>>> if you consider the lower bound of each tuple to be the leading keys that
>>> you don't need to sort.
>>>
>>
>> Yes. But, for instance btree accesses heap for unique checking. Is really
>> it so crimilal? :-)
>>
>
> Well, it is generally considered an ugly hack in b-tree too. I'm not 100%
> opposed to doing such a hack in GiST, but would very much prefer not to.
>
>
> This is not only question of a new node or extending existing node. We
>> need
>> to teach planner/executor access method can return value of some
>> expression
>> which is lower bound of another expression. AFICS now access method can
>> return only original indexed datums and TIDs. So, I afraid that enormous
>> infrastructure changes are required. And I can hardly imagine what they
>> should look like.
>>
>
> Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along with
> xs_itup. Or as an attribute of xs_itup itself.

This shouldn't look like a hack too. Otherwise I see no point of it: it's
better to have some isolated hack in access method than hack in
planner/executor. So I see following changes to be needed to implement this
right way:
1) Implement new relation between operators: operator1 is lower bound of
operator2.
2) Extend am interface to let it return values of operators.
3) Implement new node for knn-sorting.
However, it requires a lot of changes in PostgreSQL infrastructure and can
appear to be not enough general too (we don't know until we have another
application).

------
With best regards,
Alexander Korotkov.


From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-08-03 13:48:09
Message-ID: 20140803134809.GA58181@hasegeli-2.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > 1. This patch introduces a new "polygon <-> point" operator. That seems
> > useful on its own, with or without this patch.
>
> Yeah, but exact-knn cant come with no one implementation. But it would
> better come in a separate patch.

I tried to split them. Separated patches are attached. I changed
the order of the arguments as point <-> polygon, because point was
the first one on all the others. Its commutator was required for
the index, so I added it on the second patch. I also added tests
for the operator. I think it is ready for committer as a separate
patch. We can add it to the open CommitFest.

I have made some cosmetic changes on the patches. I hope they are
useful.

I added support to point <-> circle operator with the same GiST
distance function you added for polygon. I can change it, if it is not
the right way.

> > 2. I wonder how useful it really is to allow mixing exact and non-exact
> > return values from the distance function. The distance function included in
> > the patch always returns recheck=true. I have a feeling that all other
> > distance function will also always return either true or false.
>
> For geometrical datatypes recheck variations in consistent methods are also
> very rare (I can't remember any). But imagine opclass for arrays where keys
> have different representation depending on array length. For such opclass
> and knn on similarity recheck flag could be useful.

I also wonder how useful it is. Your example is convincing, but maybe
setting it index-wide will make the decisions on the framework easier.
For example, how hard would it be to decide if further sorting is
required or not on the planner?

> > 4. (as you mentioned in the other thread: ) It's a modularity violation
> > that you peek into the heap tuple from gist. I think the proper way to do
> > this would be to extend the IndexScan executor node to perform the
> > re-shuffling of tuples that come from the index in wrong order, or perhaps
> > add a new node type for it.
> >
> > Of course that's exactly what your partial sort patch does :-). I haven't
> > looked at that in detail, but I don't think the approach the partial sort
> > patch takes will work here as is. In the KNN-GiST case, the index is
> > returning tuples roughly in the right order, but a tuple that it returns
> > might in reality belong somewhere later in the ordering. In the partial
> > sort patch, the "input stream" of tuples is divided into non-overlapping
> > groups, so that the tuples within the group are not sorted, but the groups
> > are. I think the partial sort case is a special case of the KNN-GiST case,
> > if you consider the lower bound of each tuple to be the leading keys that
> > you don't need to sort.
>
> Yes. But, for instance btree accesses heap for unique checking. Is really
> it so crimilal? :-)
> This is not only question of a new node or extending existing node. We need
> to teach planner/executor access method can return value of some expression
> which is lower bound of another expression. AFICS now access method can
> return only original indexed datums and TIDs. So, I afraid that enormous
> infrastructure changes are required. And I can hardly imagine what they
> should look like.

Unfortunately, I am not experienced enough to judge your implementation.
As far as I understand the problem is partially sorting rows on
the index scan node. It can lead the planner to choose non-optimal
plans, because of not taking into account the cost of sorting.

Attachment Content-Type Size
polygon-distance-op-1.patch text/plain 12.7 KB
knn-gist-recheck-2.patch text/plain 45.9 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-08-04 20:43:41
Message-ID: CAPpHfdu3JmBrsgsrtaAQJMpXyMgrW+T2P7Gmf9oP8L1V7zLvOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Aug 3, 2014 at 5:48 PM, Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:

> > > 1. This patch introduces a new "polygon <-> point" operator. That seems
> > > useful on its own, with or without this patch.
> >
> > Yeah, but exact-knn cant come with no one implementation. But it would
> > better come in a separate patch.
>
> I tried to split them. Separated patches are attached. I changed
> the order of the arguments as point <-> polygon, because point was
> the first one on all the others. Its commutator was required for
> the index, so I added it on the second patch. I also added tests
> for the operator. I think it is ready for committer as a separate
> patch. We can add it to the open CommitFest.
>
> I have made some cosmetic changes on the patches. I hope they are
> useful.
>
> I added support to point <-> circle operator with the same GiST
> distance function you added for polygon. I can change it, if it is not
> the right way.

Great, thanks!

> > 2. I wonder how useful it really is to allow mixing exact and non-exact

> > > return values from the distance function. The distance function
> included in
> > > the patch always returns recheck=true. I have a feeling that all other
> > > distance function will also always return either true or false.
> >
> > For geometrical datatypes recheck variations in consistent methods are
> also
> > very rare (I can't remember any). But imagine opclass for arrays where
> keys
> > have different representation depending on array length. For such opclass
> > and knn on similarity recheck flag could be useful.
>
> I also wonder how useful it is. Your example is convincing, but maybe
> setting it index-wide will make the decisions on the framework easier.
> For example, how hard would it be to decide if further sorting is
> required or not on the planner?
>

I think that for most use cases just some operators require further sorting
and some of them not. But it could appear one day that some index gives
part of its knn answers exact and part of them inexact. Same happen to
recheck of regular operators. Initially recheck flag was defined in
opclass. But later recheck became runtime flag.

> > 4. (as you mentioned in the other thread: ) It's a modularity violation
> > > that you peek into the heap tuple from gist. I think the proper way to
> do
> > > this would be to extend the IndexScan executor node to perform the
> > > re-shuffling of tuples that come from the index in wrong order, or
> perhaps
> > > add a new node type for it.
> > >
> > > Of course that's exactly what your partial sort patch does :-). I
> haven't
> > > looked at that in detail, but I don't think the approach the partial
> sort
> > > patch takes will work here as is. In the KNN-GiST case, the index is
> > > returning tuples roughly in the right order, but a tuple that it
> returns
> > > might in reality belong somewhere later in the ordering. In the partial
> > > sort patch, the "input stream" of tuples is divided into
> non-overlapping
> > > groups, so that the tuples within the group are not sorted, but the
> groups
> > > are. I think the partial sort case is a special case of the KNN-GiST
> case,
> > > if you consider the lower bound of each tuple to be the leading keys
> that
> > > you don't need to sort.
> >
> > Yes. But, for instance btree accesses heap for unique checking. Is really
> > it so crimilal? :-)
> > This is not only question of a new node or extending existing node. We
> need
> > to teach planner/executor access method can return value of some
> expression
> > which is lower bound of another expression. AFICS now access method can
> > return only original indexed datums and TIDs. So, I afraid that enormous
> > infrastructure changes are required. And I can hardly imagine what they
> > should look like.
>
> Unfortunately, I am not experienced enough to judge your implementation.
> As far as I understand the problem is partially sorting rows on
> the index scan node. It can lead the planner to choose non-optimal
> plans, because of not taking into account the cost of sorting.
>

Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
method breaks incapsulation. One idea about this is to do sorting in
another nodes. However, I wonder if it would be an overengineering and
overhead. In attached patch I propose a different approach: put code
touching heap into separate index_get_heap_values function. Also new
version of patch includes regression tests and some cleanup.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-3.patch application/octet-stream 41.5 KB

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-14 18:09:36
Message-ID: 20140914180936.GA5084@hasegeli.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I added the point to polygon distance operator patch to the open
CommitFest as ready for committer and added myself as reviewer to
both of the patches.

> I think that for most use cases just some operators require further sorting
> and some of them not. But it could appear one day that some index gives
> part of its knn answers exact and part of them inexact. Same happen to
> recheck of regular operators. Initially recheck flag was defined in
> opclass. But later recheck became runtime flag.

I cannot think of an use case, but it makes sense to add the flag to
the distance function just like the consistent function if we will go
with this implementation.

> Cost estimation of GiST is a big problem anyway. It doesn't care (and
> can't) about amount of recheck for regular operators. In this patch, same
> would be for knn recheck. The problem is that touching heap from access
> method breaks incapsulation. One idea about this is to do sorting in
> another nodes. However, I wonder if it would be an overengineering and
> overhead. In attached patch I propose a different approach: put code
> touching heap into separate index_get_heap_values function. Also new
> version of patch includes regression tests and some cleanup.

While looking it at I found a bug. It returns the second column
in wrong order when both of the distance functions return recheck = true.
Test script attached to run on the regression database. I tried to
fix but could not. searchTreeItemDistanceRecheck function is not
very easy to follow. I think it deserves more comments.

Attachment Content-Type Size
knn-gist-recheck-test-multicolumn.sql application/x-sql 1.1 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-14 19:34:26
Message-ID: CAPpHfdvEpZAJrMBkNrCguG4fCK5QtZjARPti7LnL5chsSuD=Sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Sep 14, 2014 at 10:09 PM, Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:

> I added the point to polygon distance operator patch to the open
> CommitFest as ready for committer and added myself as reviewer to
> both of the patches.
>

Thanks.

> Cost estimation of GiST is a big problem anyway. It doesn't care (and
> > can't) about amount of recheck for regular operators. In this patch, same
> > would be for knn recheck. The problem is that touching heap from access
> > method breaks incapsulation. One idea about this is to do sorting in
> > another nodes. However, I wonder if it would be an overengineering and
> > overhead. In attached patch I propose a different approach: put code
> > touching heap into separate index_get_heap_values function. Also new
> > version of patch includes regression tests and some cleanup.
>
> While looking it at I found a bug. It returns the second column
> in wrong order when both of the distance functions return recheck = true.
> Test script attached to run on the regression database. I tried to
> fix but could not. searchTreeItemDistanceRecheck function is not
> very easy to follow. I think it deserves more comments.
>

Fixed, thanks. It was logical error in comparison function implementation.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-4.patch application/octet-stream 42.0 KB

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-17 08:12:50
Message-ID: 20140917081250.GA28651@hasegeli.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > While looking it at I found a bug. It returns the second column
> > in wrong order when both of the distance functions return recheck = true.
> > Test script attached to run on the regression database. I tried to
> > fix but could not. searchTreeItemDistanceRecheck function is not
> > very easy to follow. I think it deserves more comments.
>
> Fixed, thanks. It was logical error in comparison function implementation.

I managed to break it again by ordering rows only by the second column
of the index. Test script attached.

Attachment Content-Type Size
knn-gist-recheck-test-secondcolumn.sql application/x-sql 1.1 KB

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-17 08:30:00
Message-ID: 20140917083000.GA7869@hasegeli.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I managed to break it again by ordering rows only by the second column
> of the index. Test script attached.

I was confused. It is undefined behavior. Sorry for the noise.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-17 09:04:16
Message-ID: CAPpHfduyu=Sr8iZZdt-=zGURRr49jHUmn9kPFiqZzTc=bv56hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 17, 2014 at 12:30 PM, Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:

> > I managed to break it again by ordering rows only by the second column
> > of the index. Test script attached.
>
> I was confused. It is undefined behavior. Sorry for the noise.
>

No problem. Thanks a lot for testing.

------
With best regards,
Alexander Korotkov.


From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-25 17:00:46
Message-ID: 20140925170046.GA29657@hasegeli.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Fixed, thanks.

Here are my questions and comments about the code.

doc/src/sgml/gist.sgml:812:
> be rechecked from heap tuple before tuple is returned. If
> <literal>recheck</> flag isn't set then it's true by default for
> compatibility reasons. The <literal>recheck</> flag can be used only

Recheck flag is set to false on gistget.c so I think it should say
"false by default". On the other hand, it is true by default on
the consistent function. It is written as "the safest assumption"
on the code comments. I don't know why the safest is chosen over
the backwards compatible for the consistent function.

src/backend/access/gist/gistget.c:505:
> /* Recheck distance from heap tuple if needed */
> if (GISTSearchItemIsHeap(*item) &&
> searchTreeItemNeedDistanceRecheck(scan, so->curTreeItem))
> {
> searchTreeItemDistanceRecheck(scan, so->curTreeItem, item);
> continue;
> }

Why so->curTreeItem is passed to these functions? They can use
scan->opaque->curTreeItem.

src/backend/access/gist/gistscan.c:49:
> /*
> * When all distance values are the same, items without recheck
> * can be immediately returned. So they are placed first.
> */
> if (recheckCmp == 0 && distance_a.recheck != distance_b.recheck)
> recheckCmp = distance_a.recheck ? 1 : -1;

I don't understand why items without recheck can be immediately
returned. Do you think it will work correctly when there is
an operator class which will return recheck true and false for
the items under the same page?

src/backend/access/index/indexam.c:258:
> /* Prepare data structures for getting original indexed values from heap */
> scan->indexInfo = BuildIndexInfo(scan->indexRelation);
> scan->estate = CreateExecutorState();
> scan->slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));

With the changes in indexam.c, heap access become legal for all index
access methods. I think it is better than the previous version but
I am leaving the judgement to someone experienced. I will try to
summarize the pros and cons of sorting the rows in the GiST access
method, as far as I understand.

Pros:

* It does not require another queue. It should be effective to sort
the rows inside the queue the GiST access method already has.
* It does not complicate index access method infrastructure.

Cons:

* It could be done without additional heap access.
* Other access methods could make use of the sorting infrastructure
one day.
* It could be more transparent to the users. Sorting information
could be shown on the explain output.
* A more suitable data structure like binary heap could be used
for the queue to sort the rows.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-25 17:25:02
Message-ID: CAPpHfds_j-c+WtQWA87U_PkF0wp_X0yHGakDrTR0JTOnwCrt-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 25, 2014 at 9:00 PM, Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:

> > Fixed, thanks.
>
> Here are my questions and comments about the code.
>
> doc/src/sgml/gist.sgml:812:
> > be rechecked from heap tuple before tuple is returned. If
> > <literal>recheck</> flag isn't set then it's true by default for
> > compatibility reasons. The <literal>recheck</> flag can be used
> only
>
> Recheck flag is set to false on gistget.c so I think it should say
> "false by default". On the other hand, it is true by default on
> the consistent function. It is written as "the safest assumption"
> on the code comments. I don't know why the safest is chosen over
> the backwards compatible for the consistent function.
>

Agree. It should be clarified in docs.

src/backend/access/gist/gistget.c:505:
> > /* Recheck distance from heap tuple if needed */
> > if (GISTSearchItemIsHeap(*item) &&
> > searchTreeItemNeedDistanceRecheck(scan,
> so->curTreeItem))
> > {
> > searchTreeItemDistanceRecheck(scan,
> so->curTreeItem, item);
> > continue;
> > }
>
> Why so->curTreeItem is passed to these functions? They can use
> scan->opaque->curTreeItem.
>

I didn't get the difference. Few lines before:

GISTScanOpaque so = (GISTScanOpaque) scan->opaque;

> src/backend/access/gist/gistscan.c:49:
> > /*
> > * When all distance values are the same, items without
> recheck
> > * can be immediately returned. So they are placed first.
> > */
> > if (recheckCmp == 0 && distance_a.recheck !=
> distance_b.recheck)
> > recheckCmp = distance_a.recheck ? 1 : -1;
>
> I don't understand why items without recheck can be immediately
> returned. Do you think it will work correctly when there is
> an operator class which will return recheck true and false for
> the items under the same page?
>

Yes, I believe so. Item with recheck can't decrease it's distance, it can
only increase it. In the corner case item can have same distance after
recheck as it was before. Then anyway items which distances are the same
can be returned in any order.

> src/backend/access/index/indexam.c:258:
> > /* Prepare data structures for getting original indexed values
> from heap */
> > scan->indexInfo = BuildIndexInfo(scan->indexRelation);
> > scan->estate = CreateExecutorState();
> > scan->slot =
> MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));
>
> With the changes in indexam.c, heap access become legal for all index
> access methods. I think it is better than the previous version but
> I am leaving the judgement to someone experienced. I will try to
> summarize the pros and cons of sorting the rows in the GiST access
> method, as far as I understand.
>
> Pros:
>
> * It does not require another queue. It should be effective to sort
> the rows inside the queue the GiST access method already has.
> * It does not complicate index access method infrastructure.
>
> Cons:
>
> * It could be done without additional heap access.
> * Other access methods could make use of the sorting infrastructure
> one day.
> * It could be more transparent to the users. Sorting information
> could be shown on the explain output.
>

It would be also nice to show some information about KNN itself.

> * A more suitable data structure like binary heap could be used
> for the queue to sort the rows.
>

Binary heap seems to be better data structure for whole KNN-GiST. But it's
a subject for a separate patch: replace RB-tree to heap in KNN-GiST. It's
not related to recheck stuff.

------
With best regards,
Alexander Korotkov.


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-26 01:18:00
Message-ID: 20140926011800.GA8130@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
> > Cost estimation of GiST is a big problem anyway. It doesn't care (and
> > can't) about amount of recheck for regular operators. In this patch, same
> > would be for knn recheck. The problem is that touching heap from access

This is very important work. While our existing KNN-GiST index code
works fine for scalar values and point-to-point distance ordering, it
doesn't work well for 2-dimensional objects because they are only
indexed by their bounding boxes (a rectangle around the object). The
indexed bounding box can't produce accurate distances to other objects.

As an example, see this PostGIS blog post showing how to use LIMIT in a
CTE to filter results and then compute the closest object (search for
"LIMIT 50"):

http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

This patch fixes our code for distances from a point to indexed 2-D
objects.

Does this also fix the identical PostGIS problem or is there something
PostGIS needs to do?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-26 06:49:42
Message-ID: CAPpHfdvDMS1_jqnzZ0FbGePWwCvStj4_h0xU2hYTnLcC6of3Rg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 26, 2014 at 5:18 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
> > > Cost estimation of GiST is a big problem anyway. It doesn't care
> (and
> > > can't) about amount of recheck for regular operators. In this
> patch, same
> > > would be for knn recheck. The problem is that touching heap from
> access
>
> This is very important work. While our existing KNN-GiST index code
> works fine for scalar values and point-to-point distance ordering, it
> doesn't work well for 2-dimensional objects because they are only
> indexed by their bounding boxes (a rectangle around the object). The
> indexed bounding box can't produce accurate distances to other objects.
>
> As an example, see this PostGIS blog post showing how to use LIMIT in a
> CTE to filter results and then compute the closest object (search for
> "LIMIT 50"):
>
>
> http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html
>
> This patch fixes our code for distances from a point to indexed 2-D
> objects.
>
> Does this also fix the identical PostGIS problem or is there something
> PostGIS needs to do?
>

This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
need corresponding change in its GiST opclass. Since PostGIS already define
<-> and <#> operators as distance to bounding box border and bounding box
center, it can't change their behaviour.
it has to support new operator "exact distance" in opclass.

------
With best regards,
Alexander Korotkov.


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-29 02:16:31
Message-ID: 20140929021631.GG12447@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
> Does this also fix the identical PostGIS problem or is there something
> PostGIS needs to do?
>
>
> This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
> need corresponding change in its GiST opclass. Since PostGIS already define <->
> and <#> operators as distance to bounding box border and bounding box center,
> it can't change their behaviour.
> it has to support new operator "exact distance" in opclass. 

Ah, OK, so they just need something that can be used for the recheck. I
think they currently use ST_Distance() for that. Does it have to be an
operator? If they defined an operator for ST_Distance(), would
ST_Distance() work too for KNN-GiST?

In summary, you still create a normal GiST index on the column:

http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);

which indexes by the bounding box. The new code will allow ordered
index hits to be filtered by something like ST_Distance(), rather than
having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:

EXPLAIN ANALYZE WITH distance AS (
SELECT way AS road, ref AS route
FROM planet_osm_line
WHERE highway = 'secondary'
ORDER BY ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913) <#> way
LIMIT 50
)
SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
FROM distance
ORDER BY true_distance
LIMIT 1;

Notice the CTE uses <#> (bounding box center), and then the outer query
uses ST_Distance and LIMIT 1 to find the closest item.

Excellent!

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-29 15:30:26
Message-ID: CAPpHfdvdL0dtoNBmbCsJ=W0bFE7A87hLOtoBN7Pe4hnXkQ9LxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 29, 2014 at 6:16 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
> > Does this also fix the identical PostGIS problem or is there
> something
> > PostGIS needs to do?
> >
> >
> > This patch provides general infrastructure for recheck in KNN-GiST.
> PostGIS
> > need corresponding change in its GiST opclass. Since PostGIS already
> define <->
> > and <#> operators as distance to bounding box border and bounding box
> center,
> > it can't change their behaviour.
> > it has to support new operator "exact distance" in opclass.
>
> Ah, OK, so they just need something that can be used for the recheck. I
> think they currently use ST_Distance() for that. Does it have to be an
> operator? If they defined an operator for ST_Distance(), would
> ST_Distance() work too for KNN-GiST?
>

Currently, ST_Distance is a function, but it's no problem to make it an
operator with KNN-GiST support.

> In summary, you still create a normal GiST index on the column:
>
>
> http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html
>
> CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);
>
> which indexes by the bounding box. The new code will allow ordered
> index hits to be filtered by something like ST_Distance(), rather than
> having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:
>
> EXPLAIN ANALYZE WITH distance AS (
> SELECT way AS road, ref AS route
> FROM planet_osm_line
> WHERE highway = 'secondary'
> ORDER BY ST_GeomFromText('POLYGON((14239931.42
> 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
> 3053984.84,14239931.42 3054117.72))', 900913) <#> way
> LIMIT 50
> )
> SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42
> 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
> 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
> FROM distance
> ORDER BY true_distance
> LIMIT 1;
>

Yeah. It this query 50 is pure empirical value. It could be both too low or
too high. Too low value can cause wrong query answers. Too high value can
cause lower performance. With patch simple KNN query will work like this
query with always right value in LIMIT clause.

> Notice the CTE uses <#> (bounding box center), and then the outer query
> uses ST_Distance and LIMIT 1 to find the closest item.
>
> Excellent!

Thanks. The main question now is design of this patch. Currently, it does
all the work inside access method. We already have some discussion of pro
and cons of this method. I would like to clarify alternatives now. I can
see following way:

1. Implement new executor node which performs sorting by priority queue.
Let's call it "Priority queue". I think it should be separate node from
"Sort" node. Despite "Priority queue" and "Sort" are essentially similar
from user view, they would be completely different in implementation.
2. Implement some interface to transfer distance values from access
method to "Priority queue" node.
3. Somehow tell the planner that it could use "Priority queue" in
corresponding cases. I see two ways of doing this:
- Add flag to operator in opclass indicating that index can only
order by lower bound of "col op value", not by "col op value" itself.
- Define new relation between operators. Value of one operator could
be lower bound for value of another operator. So, planner can
put "Priority
queue" node when lower bound ordering is possible from index. Also "ALTER
OPERATOR" command would be reasonable, so extensions could upgrade.

Besides overhead, this way makes significant infrastructural changes. So,
it may be over-engineering. However, it's probably more clean and beautiful
solution.
I would like to get some feedback from people familiar with KNN-GiST like
Heikki or Tom. What do you think about this? Any other ideas?

------
With best regards,
Alexander Korotkov.


From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-10-06 09:36:09
Message-ID: 20141006093609.GA12899@hasegeli.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Thanks. The main question now is design of this patch. Currently, it does
> all the work inside access method. We already have some discussion of pro
> and cons of this method. I would like to clarify alternatives now. I can
> see following way:
>
> 1. Implement new executor node which performs sorting by priority queue.
> Let's call it "Priority queue". I think it should be separate node from
> "Sort" node. Despite "Priority queue" and "Sort" are essentially similar
> from user view, they would be completely different in implementation.
> 2. Implement some interface to transfer distance values from access
> method to "Priority queue" node.

If we assume that all of them need recheck, maybe it can be done
without passing distance values.

> 3. Somehow tell the planner that it could use "Priority queue" in
> corresponding cases. I see two ways of doing this:
> - Add flag to operator in opclass indicating that index can only
> order by lower bound of "col op value", not by "col op value" itself.
> - Define new relation between operators. Value of one operator could
> be lower bound for value of another operator. So, planner can
> put "Priority
> queue" node when lower bound ordering is possible from index. Also "ALTER
> OPERATOR" command would be reasonable, so extensions could upgrade.

I think, it would be better to make it a property of the operator
class. We can add a column to pg_amop or define another value for
amoppurpose on pg_amop. Syntax can be something like this:

CREATE OPERATOR CLASS circle_ops DEFAULT
FOR TYPE circle USING gist AS
OPERATOR 15 <->(circle, point) FOR ORDER BY pg_catalog.float_ops LOWER BOUND;

While looking at it, I realize that current version of the patch does
not use the sort operator family defined with the operator class. It
assumes that the distance function will return values compatible with
the operator. Operator class definition makes me think that there is
not such an assumption.

> Besides overhead, this way makes significant infrastructural changes. So,
> it may be over-engineering. However, it's probably more clean and beautiful
> solution.
> I would like to get some feedback from people familiar with KNN-GiST like
> Heikki or Tom. What do you think about this? Any other ideas?

I would be happy to test and review the changes. I think it is nice
to solve the problem in a generalized way improving the access method
infrastructure. Definitely, we should have a consensus on the design
before working on the infrastructure changes.


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-11-21 23:20:56
Message-ID: CAM3SWZSMdDZnLUs_AoHabDVndRjtJPVLQJk04Bjtr24Wu6Nv4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 13, 2014 at 9:17 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> This patch was split from thread:
> http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
>
> I've split it to separate thead, because it's related to partial sort only
> conceptually not technically. Also I renamed it to "knn-gist-recheck" from
> "partial-knn" as more appropriate name. In the attached version docs are
> updated. Possible weak point of this patch design is that it fetches heap
> tuple from GiST scan. However, I didn't receive any notes about its design,
> so, I'm going to put it to commitfest.

The partial sort thing is not in the current 2014-10 commitfest
(although this patch is). Is that intentional?

--
Peter Geoghegan


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-11-28 09:29:28
Message-ID: CAPpHfdt6KM965n9ZrHmNcXZ6WkfSPX3KYta1rjbms4xfmghMPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 22, 2014 at 2:20 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Mon, Jan 13, 2014 at 9:17 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > This patch was split from thread:
> >
> http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
> >
> > I've split it to separate thead, because it's related to partial sort
> only
> > conceptually not technically. Also I renamed it to "knn-gist-recheck"
> from
> > "partial-knn" as more appropriate name. In the attached version docs are
> > updated. Possible weak point of this patch design is that it fetches heap
> > tuple from GiST scan. However, I didn't receive any notes about its
> design,
> > so, I'm going to put it to commitfest.
>
>
> The partial sort thing is not in the current 2014-10 commitfest
> (although this patch is). Is that intentional?

It's not. I just didn't revise partial sort yet :(

------
With best regards,
Alexander Korotkov.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-10 15:50:16
Message-ID: 54886BB8.9040000@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
>> >3. A binary heap would be a better data structure to buffer the rechecked
>> >values. A Red-Black tree allows random insertions and deletions, but in
>> >this case you need to insert arbitrary values but only remove the minimum
>> >item. That's exactly what a binary heap excels at. We have a nice binary
>> >heap implementation in the backend that you can use, see
>> >src/backend/lib/binaryheap.c.
>> >
> Hmm. For me binary heap would be a better data structure for KNN-GiST at
> all :-)

I decided to give this a shot, replacing the red-black tree in GiST with
the binary heap we have in lib/binaryheap.c. It made the GiST code
somewhat simpler, as the binaryheap interface is simpler than the
red-black tree one. Unfortunately, performance was somewhat worse. That
was quite surprising, as insertions and deletions are both O(log N) in
both data structures, but the red-black tree implementation is more
complicated.

I implemented another data structure called a Pairing Heap. It's also a
fairly simple data structure, but insertions are O(1) instead of O(log
N). It also performs fairly well in practice.

With that, I got a small but measurable improvement. To test, I created
a table like this:

create table gisttest (id integer, p point);
insert into gisttest select id, point(random(), random()) from
generate_series(1, 1000000) id;
create index i_gisttest on gisttest using gist (p);

And I ran this query with pgbench:

select id from gisttest order by p <-> '(0,0)' limit 1000;

With unpatched master, I got about 650 TPS, and with the patch 720 TPS.
That's a nice little improvement, but perhaps more importantly, the
pairing heap implementation consumes less memory. To measure that, I put
a MemoryContextStats(so->queueCtx) call into gistendscan. With the above
query, but without the "limit" clause, on master I got:

GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998
chunks); 21296 used

And with the patch:

GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502
chunks); 21072 used

That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice
to reduce that memory consumption, as there is no hard upper bound on
how much might be needed. If the GiST tree is really disorganized for
some reason, a query might need a lot more.

So all in all, I quite like this patch, even though it doesn't do
anything too phenomenal. It adds a some code, in the form of the new
pairing heap implementation, but it makes the GiST code a little bit
simpler. And it gives a small performance gain, and reduces memory usage
a bit.

- Heikki

Attachment Content-Type Size
knn-gist-pairingheap-1.patch text/x-diff 17.5 KB

From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-10 20:59:43
Message-ID: CAO_YK0UWipBifVgNY1zqha8DJi84Kr=WEhDFLvneKo9Wzj2fBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 10, 2014 at 1:50 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
>
>> >3. A binary heap would be a better data structure to buffer the rechecked
>>> >values. A Red-Black tree allows random insertions and deletions, but in
>>> >this case you need to insert arbitrary values but only remove the
>>> minimum
>>> >item. That's exactly what a binary heap excels at. We have a nice binary
>>> >heap implementation in the backend that you can use, see
>>> >src/backend/lib/binaryheap.c.
>>> >
>>>
>> Hmm. For me binary heap would be a better data structure for KNN-GiST at
>> all :-)
>>
>
> I decided to give this a shot, replacing the red-black tree in GiST with
> the binary heap we have in lib/binaryheap.c. It made the GiST code somewhat
> simpler, as the binaryheap interface is simpler than the red-black tree
> one. Unfortunately, performance was somewhat worse. That was quite
> surprising, as insertions and deletions are both O(log N) in both data
> structures, but the red-black tree implementation is more complicated.
>
> I implemented another data structure called a Pairing Heap. It's also a
> fairly simple data structure, but insertions are O(1) instead of O(log N).
> It also performs fairly well in practice.
>
> With that, I got a small but measurable improvement. To test, I created a
> table like this:
>
> create table gisttest (id integer, p point);
> insert into gisttest select id, point(random(), random()) from
> generate_series(1, 1000000) id;
> create index i_gisttest on gisttest using gist (p);
>
> And I ran this query with pgbench:
>
> select id from gisttest order by p <-> '(0,0)' limit 1000;
>
> With unpatched master, I got about 650 TPS, and with the patch 720 TPS.
> That's a nice little improvement, but perhaps more importantly, the pairing
> heap implementation consumes less memory. To measure that, I put a
> MemoryContextStats(so->queueCtx) call into gistendscan. With the above
> query, but without the "limit" clause, on master I got:
>
> GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998
> chunks); 21296 used
>
> And with the patch:
>
> GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks);
> 21072 used
>
> That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice
> to reduce that memory consumption, as there is no hard upper bound on how
> much might be needed. If the GiST tree is really disorganized for some
> reason, a query might need a lot more.
>
>
> So all in all, I quite like this patch, even though it doesn't do anything
> too phenomenal. It adds a some code, in the form of the new pairing heap
> implementation, but it makes the GiST code a little bit simpler. And it
> gives a small performance gain, and reduces memory usage a bit.
>
> - Heikki
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
It may be better to replace the lib/binaryheap altogether as it offers
comparable/better performance.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-10 21:50:19
Message-ID: 5488C01B.20101@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/2014 10:59 PM, Arthur Silva wrote:
> It may be better to replace the lib/binaryheap altogether as it offers
> comparable/better performance.

It's not always better. A binary heap is more memory-efficient, for
starters.

There are only two uses of lib/binaryheap: reorderbuffer.c and merge
append. Reorderbuffer isn't that performance critical, although a binary
heap may well be better there, because the comparison function is very
cheap. For merge append, it might be a win, especially if the comparison
function is expensive. (That's on the assumption that the overall number
of comparisons needed with a pairing heap is smaller - I'm not sure how
true that is). That would be worth testing.

I'd love to test some other heap implementation in in tuplesort.c. It
has a custom binary heap implementation that's used in the final merge
phase of an external sort, and also when doing a so-called bounded sort,
i.e. "ORDER BY x LIMIT Y". But that would be difficult to replace,
because tuplesort.c collects tuples in an array in memory, and then
turns that into a heap. An array is efficient to turn into a binary
heap, but to switch to another data structure, you'd suddenly need extra
memory. And we do the switch when we run out of work_mem, so allocating
more isn't really an option.

- Heikki


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-15 01:49:04
Message-ID: CAB7nPqT9n--PHbHuR_jOU_ez8dyu7ophdhrc4RD_E0g18_iCmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 11, 2014 at 12:50 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
>>>
>>> >3. A binary heap would be a better data structure to buffer the
>>> > rechecked
>>> >values. A Red-Black tree allows random insertions and deletions, but in
>>> >this case you need to insert arbitrary values but only remove the
>>> > minimum
>>> >item. That's exactly what a binary heap excels at. We have a nice binary
>>> >heap implementation in the backend that you can use, see
>>> >src/backend/lib/binaryheap.c.
>>> >
>>
>> Hmm. For me binary heap would be a better data structure for KNN-GiST at
>> all :-)
>
>
> I decided to give this a shot, replacing the red-black tree in GiST with the
> binary heap we have in lib/binaryheap.c. It made the GiST code somewhat
> simpler, as the binaryheap interface is simpler than the red-black tree one.
> Unfortunately, performance was somewhat worse. That was quite surprising, as
> insertions and deletions are both O(log N) in both data structures, but the
> red-black tree implementation is more complicated.
>
> I implemented another data structure called a Pairing Heap. It's also a
> fairly simple data structure, but insertions are O(1) instead of O(log N).
> It also performs fairly well in practice.
>
> With that, I got a small but measurable improvement. To test, I created a
> table like this:
>
> create table gisttest (id integer, p point);
> insert into gisttest select id, point(random(), random()) from
> generate_series(1, 1000000) id;
> create index i_gisttest on gisttest using gist (p);
>
> And I ran this query with pgbench:
>
> select id from gisttest order by p <-> '(0,0)' limit 1000;
>
> With unpatched master, I got about 650 TPS, and with the patch 720 TPS.
> That's a nice little improvement, but perhaps more importantly, the pairing
> heap implementation consumes less memory. To measure that, I put a
> MemoryContextStats(so->queueCtx) call into gistendscan. With the above
> query, but without the "limit" clause, on master I got:
>
> GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998 chunks);
> 21296 used
>
> And with the patch:
>
> GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks);
> 21072 used
>
> That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice to
> reduce that memory consumption, as there is no hard upper bound on how much
> might be needed. If the GiST tree is really disorganized for some reason, a
> query might need a lot more.
>
>
> So all in all, I quite like this patch, even though it doesn't do anything
> too phenomenal. It adds a some code, in the form of the new pairing heap
> implementation, but it makes the GiST code a little bit simpler. And it
> gives a small performance gain, and reduces memory usage a bit.
Hum. It looks that this patch using binary heap is intended to be a
replacement red-black tree method. Any reason why it isn't added to
the CF to track it?
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-12-15 01:56:54
Message-ID: CAB7nPqTODFCmHEU9MfQUmNuWX8VsJFS25=ptNqw0ew6FdZOOww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 28, 2014 at 10:54 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> 1. This patch introduces a new "polygon <-> point" operator. That seems
> useful on its own, with or without this patch.
This patch is tracked with this entry in the commit fest app and is
marked as "Ready for committer". Hence I am moving this specific part
to 2014-12 to keep track of it:

> 3. A binary heap would be a better data structure to buffer the rechecked
> values. A Red-Black tree allows random insertions and deletions, but in this
> case you need to insert arbitrary values but only remove the minimum item.
> That's exactly what a binary heap excels at. We have a nice binary heap
> implementation in the backend that you can use, see
> src/backend/lib/binaryheap.c.
Based on those comments, I am marking this entry as returned with feedback:
https://commitfest.postgresql.org/action/patch_view?id=1367
Heikki has sent as well a new patch to use a binary heap method
instead of the red-black tree here:
http://www.postgresql.org/message-id/54886BB8.9040000@vmware.com
IMO this last patch should be added in the CF app, that's not the case now.
Regards,
--
Michael


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-15 13:08:28
Message-ID: 548EDD4C.3030600@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/15/2014 03:49 AM, Michael Paquier wrote:
> On Thu, Dec 11, 2014 at 12:50 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
>>>>
>>>>> 3. A binary heap would be a better data structure to buffer the
>>>>> rechecked
>>>>> values. A Red-Black tree allows random insertions and deletions, but in
>>>>> this case you need to insert arbitrary values but only remove the
>>>>> minimum
>>>>> item. That's exactly what a binary heap excels at. We have a nice binary
>>>>> heap implementation in the backend that you can use, see
>>>>> src/backend/lib/binaryheap.c.
>>>>>
>>>
>>> Hmm. For me binary heap would be a better data structure for KNN-GiST at
>>> all :-)
>>
>>
>> I decided to give this a shot, replacing the red-black tree in GiST with the
>> binary heap we have in lib/binaryheap.c. It made the GiST code somewhat
>> simpler, as the binaryheap interface is simpler than the red-black tree one.
>> Unfortunately, performance was somewhat worse. That was quite surprising, as
>> insertions and deletions are both O(log N) in both data structures, but the
>> red-black tree implementation is more complicated.
>>
>> I implemented another data structure called a Pairing Heap. It's also a
>> fairly simple data structure, but insertions are O(1) instead of O(log N).
>> It also performs fairly well in practice.
>>
>> With that, I got a small but measurable improvement. To test, I created a
>> table like this:
>>
>> create table gisttest (id integer, p point);
>> insert into gisttest select id, point(random(), random()) from
>> generate_series(1, 1000000) id;
>> create index i_gisttest on gisttest using gist (p);
>>
>> And I ran this query with pgbench:
>>
>> select id from gisttest order by p <-> '(0,0)' limit 1000;
>>
>> With unpatched master, I got about 650 TPS, and with the patch 720 TPS.
>> That's a nice little improvement, but perhaps more importantly, the pairing
>> heap implementation consumes less memory. To measure that, I put a
>> MemoryContextStats(so->queueCtx) call into gistendscan. With the above
>> query, but without the "limit" clause, on master I got:
>>
>> GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998 chunks);
>> 21296 used
>>
>> And with the patch:
>>
>> GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks);
>> 21072 used
>>
>> That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice to
>> reduce that memory consumption, as there is no hard upper bound on how much
>> might be needed. If the GiST tree is really disorganized for some reason, a
>> query might need a lot more.
>>
>>
>> So all in all, I quite like this patch, even though it doesn't do anything
>> too phenomenal. It adds a some code, in the form of the new pairing heap
>> implementation, but it makes the GiST code a little bit simpler. And it
>> gives a small performance gain, and reduces memory usage a bit.
> Hum. It looks that this patch using binary heap is intended to be a
> replacement red-black tree method.

Right.

Here's a new version of the patch. It now uses the same pairing heap
code that I posted in the other thread ("advance local xmin more
aggressivley",
http://www.postgresql.org/message-id/5488ACF0.8050901@vmware.com). The
pairingheap_remove() function is unused in this patch, but it is needed
by that other patch.

> Any reason why it isn't added to the CF to track it?

No. Will add.

- Heikki

Attachment Content-Type Size
knn-gist-pairingheap-2.patch text/x-diff 19.6 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-15 13:14:26
Message-ID: 20141215131426.GH5023@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-15 15:08:28 +0200, Heikki Linnakangas wrote:
> +/*-------------------------------------------------------------------------
> + *
> + * pairingheap.c
> + * A Pairing Heap implementation
> + *
> + * Portions Copyright (c) 2012-2014, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + * src/backend/lib/pairingheap.c
> + *
> + *-------------------------------------------------------------------------
> + */

> diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
> new file mode 100644
> index 0000000..e78196d
> --- /dev/null
> +++ b/src/include/lib/pairingheap.h
> @@ -0,0 +1,67 @@
> +/*
> + * pairingheap.h
> + *
> + * A Pairing Heap implementation
> + *
> + * Portions Copyright (c) 2012-2014, PostgreSQL Global Development Group
> + *
> + * src/include/lib/pairingheap.h
> + */
> +

If we add another heap implementation we probably should at least hint
at the different advantages somewhere.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-12-15 15:16:21
Message-ID: 548EFB45.1000705@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/03/2014 04:48 PM, Emre Hasegeli wrote:
>>> 1. This patch introduces a new "polygon <-> point" operator. That seems
>>> useful on its own, with or without this patch.
>>
>> Yeah, but exact-knn cant come with no one implementation. But it would
>> better come in a separate patch.
>
> I tried to split them. Separated patches are attached. I changed
> the order of the arguments as point <-> polygon, because point was
> the first one on all the others. Its commutator was required for
> the index, so I added it on the second patch. I also added tests
> for the operator. I think it is ready for committer as a separate
> patch. We can add it to the open CommitFest.

Ok, committed this part now with minor changes. The implementation was
copy-pasted from circle <-> polygon, so I put the common logic to a
dist_ppoly_internal function, and called that in both dist_cpoly and
dist_ppoly.

I was surprised that there were no documentation changes in the patch,
but looking at the docs, we just list the geometric operators without
explaining what the argument types are. That's not very comprehensive,
might be good to expand the docs on that, but it's not this patch's fault.

- Heikki


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-15 21:59:58
Message-ID: CAMkU=1xTWjb9bfBOHrmJ7YOVwT4kXEEnX=d8Qa81kPK98WUnaQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 15, 2014 at 5:08 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

Here's a new version of the patch. It now uses the same pairing heap code
> that I posted in the other thread ("advance local xmin more aggressivley",
> http://www.postgresql.org/message-id/5488ACF0.8050901@vmware.com). The
> pairingheap_remove() function is unused in this patch, but it is needed by
> that other patch.

Under enable-cassert, this tries to call pairingheap_empty, but that
function doesn't exist.

I looked in the other patch and didn't find it defined there, either.

cheers,

Jeff


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-16 09:23:56
Message-ID: 548FFA2C.7060000@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/15/2014 11:59 PM, Jeff Janes wrote:
> On Mon, Dec 15, 2014 at 5:08 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
>> wrote:
>
> Here's a new version of the patch. It now uses the same pairing heap code
>> that I posted in the other thread ("advance local xmin more aggressivley",
>> http://www.postgresql.org/message-id/5488ACF0.8050901@vmware.com). The
>> pairingheap_remove() function is unused in this patch, but it is needed by
>> that other patch.
>
> Under enable-cassert, this tries to call pairingheap_empty, but that
> function doesn't exist.
>
> I looked in the other patch and didn't find it defined there, either.

Ah, I renamed pairingheap_empty to pairingheap_is_empty at the last
minute, and missed the Asserts. Here's a corrected version.

- Heikki

Attachment Content-Type Size
knn-gist-pairingheap-3.patch text/x-diff 19.6 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-12-16 13:37:00
Message-ID: 5490357C.8090800@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/06/2014 12:36 PM, Emre Hasegeli wrote:
>> Thanks. The main question now is design of this patch. Currently, it does
>> all the work inside access method. We already have some discussion of pro
>> and cons of this method. I would like to clarify alternatives now. I can
>> see following way:
>>
>> 1. Implement new executor node which performs sorting by priority queue.
>> Let's call it "Priority queue". I think it should be separate node from
>> "Sort" node. Despite "Priority queue" and "Sort" are essentially similar
>> from user view, they would be completely different in implementation.
>> 2. Implement some interface to transfer distance values from access
>> method to "Priority queue" node.
>
> If we assume that all of them need recheck, maybe it can be done
> without passing distance values.

No, the executor needs the lower-bound distance value, as calculated by
the indexam, so that it knows which tuples it can return from the queue
already. For example, imagine the following items coming from the index:

tuple # lower bound actual distance
1 1 1
2 2 10
3 30 30
4 40 40

After the executor has fetched tuple 2, and re-checked the distance, it
pushes the tuple to the queue. It then fetches tuple 3, with lower bound
30, and it can now immediately return tuple # 2 from the queue. Because
10 < 30, so there cannot be any more tuples coming from the index that
would need to go before tuple # 2.

The executor needs the lower bound as calculated by the index, as well
as the actual distance it calculates itself, to make those decisions.

>> 3. Somehow tell the planner that it could use "Priority queue" in
>> corresponding cases. I see two ways of doing this:
>> - Add flag to operator in opclass indicating that index can only
>> order by lower bound of "col op value", not by "col op value" itself.
>> - Define new relation between operators. Value of one operator could
>> be lower bound for value of another operator. So, planner can
>> put "Priority
>> queue" node when lower bound ordering is possible from index. Also "ALTER
>> OPERATOR" command would be reasonable, so extensions could upgrade.
>
> I think, it would be better to make it a property of the operator
> class. We can add a column to pg_amop or define another value for
> amoppurpose on pg_amop. Syntax can be something like this:
>
> CREATE OPERATOR CLASS circle_ops DEFAULT
> FOR TYPE circle USING gist AS
> OPERATOR 15 <->(circle, point) FOR ORDER BY pg_catalog.float_ops LOWER BOUND;
>
> While looking at it, I realize that current version of the patch does
> not use the sort operator family defined with the operator class. It
> assumes that the distance function will return values compatible with
> the operator. Operator class definition makes me think that there is
> not such an assumption.

Yeah. I also noticed that the type of the argument passed to the
consistent function varies, and doesn't necessarily match that declared
in pg_proc. Looking at gist_point_consistent, the argument type can be a
point, a polygon, or a circle, depending on the "strategy group". But
it's declared as a point in pg_proc.

>> Besides overhead, this way makes significant infrastructural changes. So,
>> it may be over-engineering. However, it's probably more clean and beautiful
>> solution.
>> I would like to get some feedback from people familiar with KNN-GiST like
>> Heikki or Tom. What do you think about this? Any other ideas?
>
> I would be happy to test and review the changes. I think it is nice
> to solve the problem in a generalized way improving the access method
> infrastructure. Definitely, we should have a consensus on the design
> before working on the infrastructure changes.

I took a stab on this. I added the reorder queue directly to the Index
Scan node, rather than adding a whole new node type for it. It seems
reasonable, as Index Scan is responsible for rechecking the quals, too,
even though re-ordering the tuples is more complicated than rechecking
quals.

To recap, the idea is that the index can define an ordering op, even if
it cannot return the tuples in exactly the right order. It is enough
that for each tuple, it returns a lower bound of the expression that is
used for sorting. For example, for "ORDER BY key <-> column", it is
enough that it returns a lower bound of "key <-> column" for each tuple.
The index must return the tuples ordered by the lower bounds. The
executor re-checks the expressions, and re-orders the tuples to the
correct order.

Patch attached. It should be applied on top of my pairing heap patch at
http://www.postgresql.org/message-id/548FFA2C.7060000@vmware.com. Some
caveats:

* The signature of the distance function is unchanged, it doesn't get a
recheck argument. It is just assumed that if the consistent function
sets the recheck flag, then the distance needs to be rechecked as well.
We might want to add the recheck argument, like you Alexander did in
your patch, but it's not important right now.

* I used the "distance" term in the executor, although the ORDER BY expr
machinery is more general than that. The value returned by the ORDER BY
expression doesn't have to be a distance, although that's the only thing
supported by GiST and the built-in opclasses.

* I short-circuited the planner to assume that the ORDER BY expression
always returns a float. That's true today for knn-GiST, but is obviously
a bogus assumption in general.

This needs some work to get into a committable state, but from a
modularity point of view, this is much better than having the indexam to
peek into the heap.

- Heikki

Attachment Content-Type Size
knn-gist-recheck-distance-in-executor-1.patch text/x-diff 34.9 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-17 14:07:34
Message-ID: 54918E26.10201@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/15/2014 03:14 PM, Andres Freund wrote:
> If we add another heap implementation we probably should at least hint
> at the different advantages somewhere.

How about adding a src/backend/lib/README for that, per attached?

- Heikki

Attachment Content-Type Size
knn-gist-pairingheap-4.patch text/x-diff 20.6 KB

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-20 20:50:09
Message-ID: CAM3SWZQOBog8FiWA5WOYyATzm5y0ac8bj077F0vGqg__1HOMYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 6:07 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> How about adding a src/backend/lib/README for that, per attached?

I took a quick look at this. Some observations:

* I like the idea of adding a .../lib README. However, the README
fails to note that ilist.c implements an *integrated* list, unlike the
much more prevalent cell-based List structure. It should note that,
since that's the whole point of ilist.c.

* pairingheap_remove() is technically dead code. It still makes sense
that you'd have it in this patch, but I think there's an argument for
not including it at all on the theory that if you need to use it you
should use a different data structure. After all, the actual
(non-amortized) complexity of that operation is O(n) [1], and if
remove operations are infrequent as we might expect, that might be the
more important consideration. As long as you are including
pairingheap_remove(), though, why is the local variable "prev_ptr" a
pointer to a pointer to a pairingheap_node, rather than just a pointer
to a pairingheap_node?

* Similarly, the function-like macro pairingheap_reset() doesn't seem
to pull its weight. Why does it exist alongside pairingheap_free()?
I'm not seeing a need to re-use a heap like that.

* "Assert(!pairingheap_is_empty(heap))" appears in a few places.
You're basically asserting that a pointer isn't null, often
immediately before dereferencing the pointer. This seems to be of
questionable value.

* I think that the indentation of code could use some tweaking.

* More comments, please. In particular, comment the struct fields in
pairingheap_node. There are various blocks of code that could use at
least an additional terse comment, too.

* You talked about tuplesort.c integration. In order for that to
happen, I think the comparator logic should know less about min-heaps.
This should formally be a max-heap, with the ability to provide
customizations only encapsulated in the comparator (like inverting the
comparison logic to get a min-heap, or like custom NULLs first/last
behavior). So IMV this comment should be more generic/anticipatory:

+ /*
+ * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
+ * and >0 iff a > b. For a min-heap, the conditions are reversed.
+ */
+ typedef int (*pairingheap_comparator) (const pairingheap_node *a,
const pairingheap_node *b, void *arg);

I think the functions should be called pairing_max_heap* for this
reason, too. Although that isn't consistent with binaryheap.c, so I
guess this whole argument is a non-starter.

* We should just move rbtree.c to .../lib. We're not using CVS anymore
-- the history will be magically preserved.

Anyway, to get to the heart of the matter: in general, I think the
argument for the patch is sound. It's not a stellar improvement, but
it's worthwhile. That's all I have for now...

[1] https://www.cise.ufl.edu/~sahni/dsaac/enrich/c13/pairing.htm
--
Peter Geoghegan


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2014-12-22 10:07:35
Message-ID: 5497ED67.2090109@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/20/2014 10:50 PM, Peter Geoghegan wrote:
> On Wed, Dec 17, 2014 at 6:07 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> How about adding a src/backend/lib/README for that, per attached?
>
> I took a quick look at this. Some observations:
>
> * I like the idea of adding a .../lib README. However, the README
> fails to note that ilist.c implements an *integrated* list, unlike the
> much more prevalent cell-based List structure. It should note that,
> since that's the whole point of ilist.c.

Added a sentence on that.

> * pairingheap_remove() is technically dead code. It still makes sense
> that you'd have it in this patch, but I think there's an argument for
> not including it at all on the theory that if you need to use it you
> should use a different data structure. After all, the actual
> (non-amortized) complexity of that operation is O(n) [1], and if
> remove operations are infrequent as we might expect, that might be the
> more important consideration. As long as you are including
> pairingheap_remove(), though, why is the local variable "prev_ptr" a
> pointer to a pointer to a pairingheap_node, rather than just a pointer
> to a pairingheap_node?
>
> * Similarly, the function-like macro pairingheap_reset() doesn't seem
> to pull its weight. Why does it exist alongside pairingheap_free()?
> I'm not seeing a need to re-use a heap like that.

pairingheap_remove and pairingheap_reset are both unused in this patch,
but they were needed for the other use case, tracking snapshots to
advance xmin more aggressively, discussed here:
http://www.postgresql.org/message-id/5488ACF0.8050901@vmware.com. In
fact, without the pairingheap_remove() operation, the prev_or_parent
pointer wouldn't be necessarily at all. We could've added them as a
separate patch, but that seems like unnecessary code churn.

The prev_ptr variable is used to set the parent's first_child pointer,
or the previous sibling's next_sibling pointer, depending on whether the
removed node is the parent's first child or not. I'll add more comments
in pairingheap_remove to explain that.

> * "Assert(!pairingheap_is_empty(heap))" appears in a few places.
> You're basically asserting that a pointer isn't null, often
> immediately before dereferencing the pointer. This seems to be of
> questionable value.

I copied that from binaryheap.c. It has some documentation value; they
make it easy to see that the functions require the heap to not be empty.
It's also explained in comments, but still.

> * I think that the indentation of code could use some tweaking.
>
> * More comments, please. In particular, comment the struct fields in
> pairingheap_node. There are various blocks of code that could use at
> least an additional terse comment, too.

Added some comments, hope it's better now.

> * You talked about tuplesort.c integration. In order for that to
> happen, I think the comparator logic should know less about min-heaps.
> This should formally be a max-heap, with the ability to provide
> customizations only encapsulated in the comparator (like inverting the
> comparison logic to get a min-heap, or like custom NULLs first/last
> behavior). So IMV this comment should be more generic/anticipatory:
>
> + /*
> + * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
> + * and >0 iff a > b. For a min-heap, the conditions are reversed.
> + */
> + typedef int (*pairingheap_comparator) (const pairingheap_node *a,
> const pairingheap_node *b, void *arg);
>
> I think the functions should be called pairing_max_heap* for this
> reason, too. Although that isn't consistent with binaryheap.c, so I
> guess this whole argument is a non-starter.

I don't see what the problem is. The pairingheap.c (and binaryheap.c)
code works the same for min and max-heaps. The comments assume a
max-heap in a few places, but that seems OK to me in the context.

> * We should just move rbtree.c to .../lib. We're not using CVS anymore
> -- the history will be magically preserved.

Yeah, I tend to agree. Tom Lane has not liked moving things, because it
breaks back-patching. That's true in general, even though git has some
smarts to follow renames. I think it would work in this case, though.
Anyway, let's discuss and do that as a separate patch, so that we don't
get stuck on that.

> Anyway, to get to the heart of the matter: in general, I think the
> argument for the patch is sound. It's not a stellar improvement, but
> it's worthwhile. That's all I have for now...

Ok, thanks for the review! I have committed this, with some cleanup and
more comments added.

- Heikki


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-01-07 22:12:09
Message-ID: CAPpHfdtm=A3hrw4=mhNxN6U2Q9bx4BYDhW8gRLxgaZD3MAQKCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 16, 2014 at 4:37 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> Patch attached. It should be applied on top of my pairing heap patch at
> http://www.postgresql.org/message-id/548FFA2C.7060000@vmware.com. Some
> caveats:
>
> * The signature of the distance function is unchanged, it doesn't get a
> recheck argument. It is just assumed that if the consistent function sets
> the recheck flag, then the distance needs to be rechecked as well. We might
> want to add the recheck argument, like you Alexander did in your patch, but
> it's not important right now.
>

I didn't get how that expected to work if we have only order by qual
without filter qual. In this case consistent function just isn't called at
all.

* I used the "distance" term in the executor, although the ORDER BY expr
> machinery is more general than that. The value returned by the ORDER BY
> expression doesn't have to be a distance, although that's the only thing
> supported by GiST and the built-in opclasses.
>
> * I short-circuited the planner to assume that the ORDER BY expression
> always returns a float. That's true today for knn-GiST, but is obviously a
> bogus assumption in general.
>
> This needs some work to get into a committable state, but from a
> modularity point of view, this is much better than having the indexam to
> peek into the heap.

Nice idea to put reordering into index scan node. Doesn't look like much of
overengineering. I'm going to bring it to more commitable state.

------
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-02-15 11:08:21
Message-ID: CAPpHfdstkF4RW_1cYje=EZyTx5VRgUK_+L3c2ucBASdD156tVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 8, 2015 at 1:12 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:

> On Tue, Dec 16, 2014 at 4:37 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> Patch attached. It should be applied on top of my pairing heap patch at
>> http://www.postgresql.org/message-id/548FFA2C.7060000@vmware.com. Some
>> caveats:
>>
>> * The signature of the distance function is unchanged, it doesn't get a
>> recheck argument. It is just assumed that if the consistent function sets
>> the recheck flag, then the distance needs to be rechecked as well. We might
>> want to add the recheck argument, like you Alexander did in your patch, but
>> it's not important right now.
>>
>
> I didn't get how that expected to work if we have only order by qual
> without filter qual. In this case consistent function just isn't called at
> all.
>
> * I used the "distance" term in the executor, although the ORDER BY expr
>> machinery is more general than that. The value returned by the ORDER BY
>> expression doesn't have to be a distance, although that's the only thing
>> supported by GiST and the built-in opclasses.
>>
>> * I short-circuited the planner to assume that the ORDER BY expression
>> always returns a float. That's true today for knn-GiST, but is obviously a
>> bogus assumption in general.
>>
>> This needs some work to get into a committable state, but from a
>> modularity point of view, this is much better than having the indexam to
>> peek into the heap.
>
>
> Nice idea to put reordering into index scan node. Doesn't look like much
> of overengineering. I'm going to bring it to more commitable state.
>

Following changes has been made in attached patch:

* Get sort operators from pathkeys.
* Recheck argument of distance function has been reverted.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-5.patch application/octet-stream 42.8 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2015-02-17 12:56:10
Message-ID: CAPpHfdvpsHUN=G3cxUPyVAxN2=nhY+x_49BHNnkdr9Q6a4Dt5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Mon, Dec 22, 2014 at 1:07 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> Ok, thanks for the review! I have committed this, with some cleanup and
> more comments added.
>

ISTM that checks in pairingheap_GISTSearchItem_cmp is incorrect. This
function should perform inverse comparison. Thus, if item a should be
checked first function should return 1. Current behavior doesn't lead to
incorrect query answers, but it could be slower than correct version.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
pairing_heap_cmp_fix.patch application/octet-stream 806 bytes

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-02-17 13:21:49
Message-ID: CAPpHfduGjVpTLVxwXmeMuFqR-ykH-ERdewF-KHxSQiRdw29-7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:

> Following changes has been made in attached patch:
>
> * Get sort operators from pathkeys.
> * Recheck argument of distance function has been reverted.
>

Few comments were added and pairing heap comparison function was fixed in
attached version of patch (knn-gist-recheck-6.patch).
Also I expected that reordering in executor would be slower than reordering
in GiST because of maintaining two heaps instead of one. I've revised
version of patch with reordering in GiST to use pairing heap. I compare two
types of reordering on 10^7 random points and polygons. Results are below.
Test shows that overhead of reordering in executor is insignificant (less
than statistical error).

Reorder in GiST Reorder in executor
points
limit=10 0.10615 0.0880125
limit=100 0.23666875 0.2292375
limit=1000 1.51486875 1.5208375
polygons
limit=10 0.11650625 0.1347
limit=100 0.46279375 0.45294375
limit=1000 3.5170125 3.54868125

Revised patch with reordering in GiST is attached
(knn-gist-recheck-in-gist.patch) as well as testing script (test.py).

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-6.patch application/octet-stream 44.1 KB
knn-gist-recheck-in-gist.patch application/octet-stream 40.7 KB
test.py text/x-python-script 1.8 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GiST kNN search queue (Re: KNN-GiST with recheck)
Date: 2015-02-17 21:00:47
Message-ID: 54E3ABFF.8050101@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/17/2015 02:56 PM, Alexander Korotkov wrote:
> Hi!
>
> On Mon, Dec 22, 2014 at 1:07 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> Ok, thanks for the review! I have committed this, with some cleanup and
>> more comments added.
>
> ISTM that checks in pairingheap_GISTSearchItem_cmp is incorrect. This
> function should perform inverse comparison. Thus, if item a should be
> checked first function should return 1. Current behavior doesn't lead to
> incorrect query answers, but it could be slower than correct version.

Good catch. Fixed, thanks.

While testing this, I also noticed a bug in the pairing heap code
itself. Fixed that too.

- Heikki


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: KNN-GiST with recheck
Date: 2015-02-24 14:39:58
Message-ID: 54EC8D3E.2000908@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 17.2.2015 14:21, Alexander Korotkov wrote:
> On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com <mailto:aekorotkov(at)gmail(dot)com>> wrote:
>
> Revised patch with reordering in GiST is attached
> (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).

I meant to do a bit of testing on this (assuming it's still needed), but
the patches need rebasing - Heikki fixed a few issues, so they don't
apply cleanly.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-02-25 09:15:38
Message-ID: CAPpHfdsZVJnMMOuTUusJQmcR1QeRN4qc93-KE014_JHqYU717g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On 17.2.2015 14:21, Alexander Korotkov wrote:
> > On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
> > <aekorotkov(at)gmail(dot)com <mailto:aekorotkov(at)gmail(dot)com>> wrote:
> >
> > Revised patch with reordering in GiST is attached
> > (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).
>
> I meant to do a bit of testing on this (assuming it's still needed), but
> the patches need rebasing - Heikki fixed a few issues, so they don't
> apply cleanly.
>

Both patches are revised.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-in-gist-2.patch application/octet-stream 40.8 KB
knn-gist-recheck-7.patch application/octet-stream 43.5 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-04-17 09:05:58
Message-ID: CAPpHfdv40pmc7N3CcpBDEcTAnLT6dWaN76LP5Q9xzTvvdwRMQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 25, 2015 at 12:15 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:

> Hi!
>
> On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> On 17.2.2015 14:21, Alexander Korotkov wrote:
>> > On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
>> > <aekorotkov(at)gmail(dot)com <mailto:aekorotkov(at)gmail(dot)com>> wrote:
>> >
>> > Revised patch with reordering in GiST is attached
>> > (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).
>>
>> I meant to do a bit of testing on this (assuming it's still needed), but
>> the patches need rebasing - Heikki fixed a few issues, so they don't
>> apply cleanly.
>>
>
> Both patches are revised.
>

Both patches are rebased against current master.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-8.patch application/octet-stream 44.0 KB
knn-gist-recheck-in-gist-3.patch application/octet-stream 41.1 KB

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-13 19:16:15
Message-ID: 5553A2FF.1030501@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/17/2015 12:05 PM, Alexander Korotkov wrote:
> On Wed, Feb 25, 2015 at 12:15 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
>
>> Hi!
>>
>> On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra <
>> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> On 17.2.2015 14:21, Alexander Korotkov wrote:
>>>> On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
>>>> <aekorotkov(at)gmail(dot)com <mailto:aekorotkov(at)gmail(dot)com>> wrote:
>>>>
>>>> Revised patch with reordering in GiST is attached
>>>> (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).
>>>
>>> I meant to do a bit of testing on this (assuming it's still needed), but
>>> the patches need rebasing - Heikki fixed a few issues, so they don't
>>> apply cleanly.
>>>
>>
>> Both patches are revised.
>>
>
> Both patches are rebased against current master.

This looks pretty much ready. I'm going to spend some time on this on
Friday, and if all looks good, commit. (Thursday's a public holiday here).

One quick comment:

It would be good to avoid the extra comparisons of the distances, when
the index doesn't return any lossy items. As the patch stands, it adds
one extra copyDistances() call and a cmp_distances() call for each tuple
(in a knn-search), even if there are no lossy tuples.

- Heikki


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-13 19:17:54
Message-ID: CAPpHfdtGyAnZK6fRFAmSyR5+AbccgxTajLvGYzdvpUg70Wsg2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 13, 2015 at 10:16 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
wrote:

> On 04/17/2015 12:05 PM, Alexander Korotkov wrote:
>
>> On Wed, Feb 25, 2015 at 12:15 PM, Alexander Korotkov <
>> aekorotkov(at)gmail(dot)com>
>> wrote:
>>
>> Hi!
>>>
>>> On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra <
>>> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>> On 17.2.2015 14:21, Alexander Korotkov wrote:
>>>>
>>>>> On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
>>>>> <aekorotkov(at)gmail(dot)com <mailto:aekorotkov(at)gmail(dot)com>> wrote:
>>>>>
>>>>> Revised patch with reordering in GiST is attached
>>>>> (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).
>>>>>
>>>>
>>>> I meant to do a bit of testing on this (assuming it's still needed), but
>>>> the patches need rebasing - Heikki fixed a few issues, so they don't
>>>> apply cleanly.
>>>>
>>>>
>>> Both patches are revised.
>>>
>>>
>> Both patches are rebased against current master.
>>
>
> This looks pretty much ready. I'm going to spend some time on this on
> Friday, and if all looks good, commit. (Thursday's a public holiday here).
>

Very good, thanks!

> One quick comment:
>
> It would be good to avoid the extra comparisons of the distances, when the
> index doesn't return any lossy items. As the patch stands, it adds one
> extra copyDistances() call and a cmp_distances() call for each tuple (in a
> knn-search), even if there are no lossy tuples.

I will fix it until Friday.

------
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-14 10:43:22
Message-ID: CAPpHfdvDk4=VRPUGvLL0rwVTys247shUeckcURQ42Y3-2tidAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 13, 2015 at 10:17 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:

> One quick comment:
>>
>> It would be good to avoid the extra comparisons of the distances, when
>> the index doesn't return any lossy items. As the patch stands, it adds one
>> extra copyDistances() call and a cmp_distances() call for each tuple (in a
>> knn-search), even if there are no lossy tuples.
>
>
> I will fix it until Friday.
>

Attached patch is rebased against current master. Extra extra
copyDistances() call and a cmp_distances() call for each tuple are avoided
in the case of no lossy tuples.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
knn-gist-recheck-9.patch application/octet-stream 44.4 KB

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-14 23:28:29
Message-ID: 55552F9D.4090603@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/14/2015 01:43 PM, Alexander Korotkov wrote:
> On Wed, May 13, 2015 at 10:17 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
>
>> One quick comment:
>>>
>>> It would be good to avoid the extra comparisons of the distances, when
>>> the index doesn't return any lossy items. As the patch stands, it adds one
>>> extra copyDistances() call and a cmp_distances() call for each tuple (in a
>>> knn-search), even if there are no lossy tuples.
>>
>>
>> I will fix it until Friday.
>>
>
> Attached patch is rebased against current master. Extra extra
> copyDistances() call and a cmp_distances() call for each tuple are avoided
> in the case of no lossy tuples.

Thanks! I spent some time cleaning this up:

* fixed a memory leak

* fixed a silly bug in rechecking multi-column scans

* I restructured the changes to IndexNext. I actually created a whole
separate copy of IndexNext, called IndexNextWithReorder, that is used
when there are ORDER BY expressions that might need to be rechecked.
There is now some duplicated code between them, but I think they are
both easier to understand this way. The IndexNext function is now as
simple as before, and the IndexNextWithReorder doesn't need so many
if()-checks on whether the reorder queue exists at all.

* I renamed Distance to OrderByValues in the executor parts. We call the
"ORDER BY x <-> y" construct an ORDER BY expression, so let's continue
using that terminology.

I think this is now ready for committing, but I'm pretty tired now so
I'll read through this one more time in the morning, so that I won't
wake up to a red buildfarm.

- Heikki


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-14 23:30:05
Message-ID: 55552FFD.8030808@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
> I think this is now ready for committing, but I'm pretty tired now so
> I'll read through this one more time in the morning, so that I won't
> wake up to a red buildfarm.

Forgot to attach the latest patch, here you go.

- Heikki

Attachment Content-Type Size
0001-Allow-GiST-distance-function-to-return-merely-a-lowe.patch application/x-patch 44.0 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-15 08:31:35
Message-ID: CAPpHfdsGYdUr80JBQ31q-7y87+s8j_t3G3hievHgFvWGQOkxYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
>
>> I think this is now ready for committing, but I'm pretty tired now so
>> I'll read through this one more time in the morning, so that I won't
>> wake up to a red buildfarm.
>>
>
> Forgot to attach the latest patch, here you go.

Looks good for me.

------
With best regards,
Alexander Korotkov.


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-15 11:48:29
Message-ID: 5555DD0D.8070903@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/15/2015 11:31 AM, Alexander Korotkov wrote:
> On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
>> On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
>>
>>> I think this is now ready for committing, but I'm pretty tired now so
>>> I'll read through this one more time in the morning, so that I won't
>>> wake up to a red buildfarm.
>>>
>>
>> Forgot to attach the latest patch, here you go.
>
>
> Looks good for me.

Ok, pushed after some further minor cleanup.

- Heikki


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-15 11:49:14
Message-ID: CAPpHfdt5m_V7SjipLmpdGN2+qPXPxmiQ8XddNw9Ey73AroYzOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 15, 2015 at 2:48 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 05/15/2015 11:31 AM, Alexander Korotkov wrote:
>
>> On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> wrote:
>>
>> On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
>>>
>>> I think this is now ready for committing, but I'm pretty tired now so
>>>> I'll read through this one more time in the morning, so that I won't
>>>> wake up to a red buildfarm.
>>>>
>>>>
>>> Forgot to attach the latest patch, here you go.
>>>
>>
>>
>> Looks good for me.
>>
>
> Ok, pushed after some further minor cleanup.

Great! Thank you!

------
With best regards,
Alexander Korotkov.


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-15 12:41:57
Message-ID: 20150515124157.GH31667@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 15, 2015 at 02:48:29PM +0300, Heikki Linnakangas wrote:
> On 05/15/2015 11:31 AM, Alexander Korotkov wrote:
> >On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> >
> >>On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
> >>
> >>>I think this is now ready for committing, but I'm pretty tired now so
> >>>I'll read through this one more time in the morning, so that I won't
> >>>wake up to a red buildfarm.
> >>>
> >>
> >>Forgot to attach the latest patch, here you go.
> >
> >
> >Looks good for me.
>
> Ok, pushed after some further minor cleanup.

Great! That PostGIS workaround they had to use for accurate distances
with CTEs and LIMIT 100 was an ugly hack.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: <hlinnaka(at)iki(dot)fi>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-15 21:42:04
Message-ID: 5556682C.5010900@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/14/15 6:30 PM, Heikki Linnakangas wrote:
> On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
>> I think this is now ready for committing, but I'm pretty tired now so
>> I'll read through this one more time in the morning, so that I won't
>> wake up to a red buildfarm.

If anyone feels motivated to fix, there's a typo in the comment for
IndexNextWithReorder (s/his/this/):
+ * Like IndexNext, but his version can also re-check any

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-15 21:48:22
Message-ID: CAPpHfdvj1D-Yq7L3vqqeZ6BZxiEjXGR6SgFvSFuTn7_iFKkntw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 15, 2015 at 2:49 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:

> On Fri, May 15, 2015 at 2:48 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
> wrote:
>
>> On 05/15/2015 11:31 AM, Alexander Korotkov wrote:
>>
>>> On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>>> wrote:
>>>
>>> On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
>>>>
>>>> I think this is now ready for committing, but I'm pretty tired now so
>>>>> I'll read through this one more time in the morning, so that I won't
>>>>> wake up to a red buildfarm.
>>>>>
>>>>>
>>>> Forgot to attach the latest patch, here you go.
>>>>
>>>
>>>
>>> Looks good for me.
>>>
>>
>> Ok, pushed after some further minor cleanup.
>
>
> Great! Thank you!
>

BTW, I found that now IndexScan node lackof copy and output support for
indexorderbyops.
Attached patch fixes that. Copy and output functions assume that
indexorderbyops has the same length as indexorderby. In order to make this
more evident I move check for best_path->path.pathkeys in create_plan from
"if" into assertion. AFAICS, pathkeys should always present where there are
indexorderby.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
fix-indexscan-node.patch application/octet-stream 2.8 KB

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2015-05-18 07:39:43
Message-ID: 5559973F.3070008@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/16/2015 12:42 AM, Jim Nasby wrote:
> On 5/14/15 6:30 PM, Heikki Linnakangas wrote:
>> On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
>>> I think this is now ready for committing, but I'm pretty tired now so
>>> I'll read through this one more time in the morning, so that I won't
>>> wake up to a red buildfarm.
>
> If anyone feels motivated to fix, there's a typo in the comment for
> IndexNextWithReorder (s/his/this/):
> + * Like IndexNext, but his version can also re-check any

Fixed, thanks.

- Heikki