Re: 9.6 phrase search distance specification

From: Ryan Pedela <rpedela(at)datalanche(dot)com>
To: obartunov(at)gmail(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.6 phrase search distance specification
Date: 2016-08-11 16:42:48
Message-ID: CACu89FTgqJyeKCDG1+PqNhJzhO5ywuhfLGxM3nwusL3WVstQKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 11, 2016 at 9:27 AM, Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:

> On Tue, Aug 9, 2016 at 9:59 PM, Ryan Pedela <rpedela(at)datalanche(dot)com>
> wrote:
> >
> >
>
> > I would say that it is worth it to have a "phrase slop" operator (Apache
> > Lucene terminology). Proximity search is extremely useful for improving
> > relevance and phrase slop is one of the tools to achieve that.
> >
>
> It'd be great if you explain what is "phrase slop". I assume it's not
> about search, but about relevance.
>

Sure. An exact phrase query has slop = 0 which means find all terms in the
exact positions relative to each other. Phrase query with slop > 0 means
find all terms within <slop> positions relative to each other. If slop =
10, find all terms within 10 positions of each other. Here is a concrete
example from my current work searching SEC filings.

Bill Gates' full legal name is William H. Gates, III. In the SEC database
[1], his name is GATES WILLIAM H III. If you are searching the records of
people within the SEC database and you want to find Bill Gates, most users
will type "bill gates". Since there are many people with the first name
Bill (William) and the last name Gates, Bill Gates most likely won't be the
first result with a standard keyword query. Likewise an exact phrase query
(slop = 0) will not find him either because the first and last names are
transposed. What you need is a phrase query with a slop = 2 which will
match "William Gates", "William H Gates", "Gates William", etc. There is
still the issue of Bill vs William, but that can be solved with synonyms
and is a different topic.

1. https://www.sec.gov/cgi-bin/browse-edgar?CIK=902012&owner=exclude&action=
getcompany&Find=Search

Thanks,
Ryan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ryan Pedela 2016-08-11 16:50:01 Re: 9.6 phrase search distance specification
Previous Message Greg Stark 2016-08-11 16:42:13 Re: No longer possible to query catalogs for index capabilities?