Re: KNN-GiST with recheck

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: KNN-GiST with recheck
Date: 2014-09-29 15:30:26
Message-ID: CAPpHfdvdL0dtoNBmbCsJ=W0bFE7A87hLOtoBN7Pe4hnXkQ9LxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 29, 2014 at 6:16 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
> > Does this also fix the identical PostGIS problem or is there
> something
> > PostGIS needs to do?
> >
> >
> > This patch provides general infrastructure for recheck in KNN-GiST.
> PostGIS
> > need corresponding change in its GiST opclass. Since PostGIS already
> define <->
> > and <#> operators as distance to bounding box border and bounding box
> center,
> > it can't change their behaviour.
> > it has to support new operator "exact distance" in opclass.
>
> Ah, OK, so they just need something that can be used for the recheck. I
> think they currently use ST_Distance() for that. Does it have to be an
> operator? If they defined an operator for ST_Distance(), would
> ST_Distance() work too for KNN-GiST?
>

Currently, ST_Distance is a function, but it's no problem to make it an
operator with KNN-GiST support.

> In summary, you still create a normal GiST index on the column:
>
>
> http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html
>
> CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);
>
> which indexes by the bounding box. The new code will allow ordered
> index hits to be filtered by something like ST_Distance(), rather than
> having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:
>
> EXPLAIN ANALYZE WITH distance AS (
> SELECT way AS road, ref AS route
> FROM planet_osm_line
> WHERE highway = 'secondary'
> ORDER BY ST_GeomFromText('POLYGON((14239931.42
> 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
> 3053984.84,14239931.42 3054117.72))', 900913) <#> way
> LIMIT 50
> )
> SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42
> 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
> 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
> FROM distance
> ORDER BY true_distance
> LIMIT 1;
>

Yeah. It this query 50 is pure empirical value. It could be both too low or
too high. Too low value can cause wrong query answers. Too high value can
cause lower performance. With patch simple KNN query will work like this
query with always right value in LIMIT clause.

> Notice the CTE uses <#> (bounding box center), and then the outer query
> uses ST_Distance and LIMIT 1 to find the closest item.
>
> Excellent!

Thanks. The main question now is design of this patch. Currently, it does
all the work inside access method. We already have some discussion of pro
and cons of this method. I would like to clarify alternatives now. I can
see following way:

1. Implement new executor node which performs sorting by priority queue.
Let's call it "Priority queue". I think it should be separate node from
"Sort" node. Despite "Priority queue" and "Sort" are essentially similar
from user view, they would be completely different in implementation.
2. Implement some interface to transfer distance values from access
method to "Priority queue" node.
3. Somehow tell the planner that it could use "Priority queue" in
corresponding cases. I see two ways of doing this:
- Add flag to operator in opclass indicating that index can only
order by lower bound of "col op value", not by "col op value" itself.
- Define new relation between operators. Value of one operator could
be lower bound for value of another operator. So, planner can
put "Priority
queue" node when lower bound ordering is possible from index. Also "ALTER
OPERATOR" command would be reasonable, so extensions could upgrade.

Besides overhead, this way makes significant infrastructural changes. So,
it may be over-engineering. However, it's probably more clean and beautiful
solution.
I would like to get some feedback from people familiar with KNN-GiST like
Heikki or Tom. What do you think about this? Any other ideas?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-09-29 15:31:46 Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Previous Message Andres Freund 2014-09-29 15:29:11 Re: [REVIEW] Re: Compression of full-page-writes