Re: fts, compond words?

Lists: pgsql-general
From: Marcus Engene <mengpg(at)engene(dot)se>
To: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: fts, compond words?
Date: 2005-12-05 11:40:10
Message-ID: 4394271A.5040703@engene.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

I use the tsearch full text search with pg 8.0.3. It works great, but I
wonder if it's possible to search for compound words?
Ie if I search for "New York" i want to get a match on
New York has traffic problems.
but not on
New axe murderer incident in brittish York.

Is this possible?

I don't use any wrapper, just
select
...
from
...
where
idxfti @@ to_tsquery('default', 'searchstring')

Thanks,
Marcus


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Marcus Engene <mengpg(at)engene(dot)se>
Cc: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-05 12:08:16
Message-ID: Pine.GSO.4.63.0512051507040.13553@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, 5 Dec 2005, Marcus Engene wrote:

> Hi,
>
> I use the tsearch full text search with pg 8.0.3. It works great, but I
> wonder if it's possible to search for compound words?
> Ie if I search for "New York" i want to get a match on
> New York has traffic problems.
> but not on
> New axe murderer incident in brittish York.
>
> Is this possible?
>
> I don't use any wrapper, just
> select
> ...
> from
> ...
> where
> idxfti @@ to_tsquery('default', 'searchstring')

ranking function is what you need. Read documentation.

>
> Thanks,
> Marcus
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Marcus Engene <mengpg(at)engene(dot)se>
To: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-05 20:38:55
Message-ID: 4394A55F.7060504@engene.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Oleg Bartunov wrote:
> On Mon, 5 Dec 2005, Marcus Engene wrote:
>
>> Hi,
>>
>> I use the tsearch full text search with pg 8.0.3. It works great, but
>> I wonder if it's possible to search for compound words?
>> Ie if I search for "New York" i want to get a match on
>> New York has traffic problems.
>> but not on
>> New axe murderer incident in brittish York.
>>
>> Is this possible?
>>
>> I don't use any wrapper, just
>> select
>> ...
>> from
>> ...
>> where
>> idxfti @@ to_tsquery('default', 'searchstring')
>
>
>
> ranking function is what you need. Read documentation.
>

Hi,

I realized from the documentation that I'm not looking for
compound words after all, I meant "exact phrase".

I can't see how to make rank tell me which results has an
exact phrase? Like "there must be a occurence of 'new' before
'york'" (stemmed not really exact phrase)?

Is there something new in rank for pg 8.1?

Thanks!
Marcus


From: Mike Rylander <mrylander(at)gmail(dot)com>
To: Marcus Engene <mengpg(at)engene(dot)se>
Cc: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-05 21:55:52
Message-ID: b918cf3d0512051355q12b0f092g52228d89756732ab@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 12/5/05, Marcus Engene <mengpg(at)engene(dot)se> wrote:
> Oleg Bartunov wrote:
> > On Mon, 5 Dec 2005, Marcus Engene wrote:
> >
> >> Hi,
> >>
> >> I use the tsearch full text search with pg 8.0.3. It works great, but
> >> I wonder if it's possible to search for compound words?
> >> Ie if I search for "New York" i want to get a match on
> >> New York has traffic problems.
> >> but not on
> >> New axe murderer incident in brittish York.
> >>
> >> Is this possible?
> >>
> >> I don't use any wrapper, just
> >> select
> >> ...
> >> from
> >> ...
> >> where
> >> idxfti @@ to_tsquery('default', 'searchstring')
> >
> >
> >
> > ranking function is what you need. Read documentation.
> >
>
> Hi,
>
> I realized from the documentation that I'm not looking for
> compound words after all, I meant "exact phrase".
>
> I can't see how to make rank tell me which results has an
> exact phrase? Like "there must be a occurence of 'new' before
> 'york'" (stemmed not really exact phrase)?
>

What you'll want to do is check the original text for the exact phrase
after the tsearch2 index has given you some targets.

Given table foo:

CREATE TABLE foo (
id serial primary key,
txt text,
ts2 tsvector
);

use query:

SELECT id FROM foo WHERE ts2 @@ to_tsquery('new&york') AND txt ILIKE
'%new york%';

You can get rid of the '%'s if you want the entire txt column to match
the search phrase.

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Marcus Engene <mengpg(at)engene(dot)se>
Cc: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-06 02:12:12
Message-ID: Pine.GSO.4.63.0512060510240.13553@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, 5 Dec 2005, Marcus Engene wrote:

>
> I realized from the documentation that I'm not looking for
> compound words after all, I meant "exact phrase".
>
> I can't see how to make rank tell me which results has an
> exact phrase? Like "there must be a occurence of 'new' before
> 'york'" (stemmed not really exact phrase)?

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes

Phrase search
This tip is by Mike Rylander

To do phrase searching just add an additional WHERE clause to your query:

SELECT id FROM tab WHERE ts_idx_col @@ to_tsquery('history&lesson')
AND text_col ~* '.*history\\s+lesson.*';

The full-text index will still be used, and the regex will be used to
prune the results afterwards.

>
> Is there something new in rank for pg 8.1?

it has some improving, but not for your case.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Marcus Engene <mengpg(at)engene(dot)se>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-general(at)postgresql(dot)org
Subject: Re: fts, compond words?
Date: 2005-12-06 11:44:27
Message-ID: 4395799B.7080406@engene.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Oleg Bartunov wrote:
> On Mon, 5 Dec 2005, Marcus Engene wrote:
>> I realized from the documentation that I'm not looking for
>> compound words after all, I meant "exact phrase".
>>
>> I can't see how to make rank tell me which results has an
>> exact phrase? Like "there must be a occurence of 'new' before
>> 'york'" (stemmed not really exact phrase)?
>
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
>
> Phrase search
> This tip is by Mike Rylander
>
> To do phrase searching just add an additional WHERE clause to your query:
>
> SELECT id FROM tab WHERE ts_idx_col @@ to_tsquery('history&lesson')
> AND text_col ~* '.*history\\s+lesson.*';
>
> The full-text index will still be used, and the regex will be used to
> prune the results afterwards.

Hi,

Thanks for the answer, Oleg and Mike.

This, I guess, will be problematic in a query like
A & (B | C)
or a more complex expression.

say C is "New York" and that tsearch receives

A & (B | (new & york))

I cannot just add the regexp afterwards. What if B is true?
What would be nice to have, given ofcourse the index isn't stripped
is something like

A & (B | (New OperatorTheNextWordMustFollow York))

Would something like that be doable? Right now, intuitively, it would be
two trees in the where clause:
tsearch(A & B) OR
(tsearch (A & C) AND regexpmatch(C))
..and a nightmare in complex queries.

Best regards,
Marcus


From: Mike Rylander <mrylander(at)gmail(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-general(at)postgresql(dot)org
Subject: Re: fts, compond words?
Date: 2005-12-07 16:54:43
Message-ID: b918cf3d0512070854h439fdec8pce7c3dd9e587efcf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 12/6/05, Marcus Engene <mengpg(at)engene(dot)se> wrote:

[snip]

> A & (B | (New OperatorTheNextWordMustFollow York))
>

Actually, I love that idea. Oleg, would it be possible to create a
tsquery operator that understands proximity? Or, how allowing a
predicate to the current '&' op, as in '&[dist<=1]' meaning "next
token follows with a max distance of 1". I imagine that it would
only be useful on unstripped tsvectors, but if the lexem position is
already stored ...

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org


From: "Andrew J(dot) Kopciuch" <akopciuch(at)bddf(dot)ca>
To: pgsql-general(at)postgresql(dot)org
Cc: Mike Rylander <mrylander(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fts, compond words?
Date: 2005-12-07 17:21:04
Message-ID: 200512071021.05321.akopciuch@bddf.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> > A & (B | (New OperatorTheNextWordMustFollow York))
>

I had thought about this before myself. Alas I have never had the time to
properly investigate implementing such a feature.

:(

A & (B | (New + York))

Something like that?

> Actually, I love that idea. Oleg, would it be possible to create a
> tsquery operator that understands proximity? Or, how allowing a
> predicate to the current '&' op, as in '&[dist<=1]' meaning "next
> token follows with a max distance of 1". I imagine that it would
> only be useful on unstripped tsvectors, but if the lexem position is
> already stored ...
>

Would the proximity go in both directions? Or just forward? What about tokens
that come before? Just a thought.

Andy


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Mike Rylander <mrylander(at)gmail(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-general(at)postgresql(dot)org
Subject: Re: fts, compond words?
Date: 2005-12-07 17:33:17
Message-ID: 43971CDD.2080208@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

That is a long discussed thing. We can't formulate unconflicting rules... For
example:
1) a &[dist<=2] ( b &[dist<=3] c )
2) a &[dist<=2] ( b |[dist<=3] c )
3) a &[dist<=2] !c
4) a &[dist<=2] ( b |[dist<=3] !c )
5) a &[dist<=2] ( b & c )
What does exact they mean? What is tsvectors which should be matched by those
queries?

The simple solution is : under operation 'phrase search' (ok, it will be '+'
below) it must be only 'phrase search operations. I.e.:
a | b ( c + ( d + e ) ) - good
a | ( c + ( d & g ) ) - bad.

For example, we have word 'foonish' and after lexize we got two lexemes: 'foo1'
and 'foo2'. So a good query 'a + foonish' becomes 'a + ( foo1 | foo2 )'...

Mike Rylander wrote:
> On 12/6/05, Marcus Engene <mengpg(at)engene(dot)se> wrote:
>
> [snip]
>
>
>> A & (B | (New OperatorTheNextWordMustFollow York))
>>
>
>
> Actually, I love that idea. Oleg, would it be possible to create a
> tsquery operator that understands proximity? Or, how allowing a
> predicate to the current '&' op, as in '&[dist<=1]' meaning "next
> token follows with a max distance of 1". I imagine that it would
> only be useful on unstripped tsvectors, but if the lexem position is
> already stored ...
>
> --
> Mike Rylander
> mrylander(at)gmail(dot)com
> GPLS -- PINES Development
> Database Developer
> http://open-ils.org
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Mike Rylander <mrylander(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>, POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-07 18:20:32
Message-ID: b918cf3d0512071020n2877e80kfebdc9f7533ed956@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 12/7/05, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
> That is a long discussed thing. We can't formulate unconflicting rules... For
> example:
> 1) a &[dist<=2] ( b &[dist<=3] c )
> 2) a &[dist<=2] ( b |[dist<=3] c )
> 3) a &[dist<=2] !c
> 4) a &[dist<=2] ( b |[dist<=3] !c )
> 5) a &[dist<=2] ( b & c )
> What does exact they mean? What is tsvectors which should be matched by those
> queries?

1,2,4, and 5 are obviously ambiguous, but 3 seems straightforward to
me, if not more difficult to implement. Would it not be acceptable to
say that proximity modifiers are only valid between two simple lexemes
and can not be placed next to any compound expression?

>
> The simple solution is : under operation 'phrase search' (ok, it will be '+'
> below) it must be only 'phrase search operations. I.e.:
> a | b ( c + ( d + e ) ) - good
> a | ( c + ( d & g ) ) - bad.
>

Same as above. And, while '+' would be a very good shortcut for
"&[follows;dist=1]" (or some such), I think the user should be able to
specify the proximity more explicitly as well.

> For example, we have word 'foonish' and after lexize we got two lexemes: 'foo1'
> and 'foo2'. So a good query 'a + foonish' becomes 'a + ( foo1 | foo2 )'...
>

hrm... that is a problem. Though, I think that's a case of how the
compiled expression is built from user input. Unless I'm mistaken

a + ( foo1 | foo2 )

is exactly equal to

(a + foo1) | (a + foo2)

Ahhh... but then there is the more complex example of

a + foonish + bar

becoming

a + (foo1 | foo2) + bar

.... but I guess that could be

(a + foo1 + bar) | (a + foo2 + bar)

>
>
>
>
> Mike Rylander wrote:
> > On 12/6/05, Marcus Engene <mengpg(at)engene(dot)se> wrote:
> >
> > [snip]
> >
> >
> >> A & (B | (New OperatorTheNextWordMustFollow York))
> >>
> >
> >
> > Actually, I love that idea. Oleg, would it be possible to create a
> > tsquery operator that understands proximity? Or, how allowing a
> > predicate to the current '&' op, as in '&[dist<=1]' meaning "next
> > token follows with a max distance of 1". I imagine that it would
> > only be useful on unstripped tsvectors, but if the lexem position is
> > already stored ...
> >
> > --
> > Mike Rylander
> > mrylander(at)gmail(dot)com
> > GPLS -- PINES Development
> > Database Developer
> > http://open-ils.org
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW: http://www.sigaev.ru/
>

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "Andrew J(dot) Kopciuch" <akopciuch(at)bddf(dot)ca>
Cc: pgsql-general(at)postgresql(dot)org, Mike Rylander <mrylander(at)gmail(dot)com>
Subject: Re: fts, compond words?
Date: 2005-12-07 19:50:52
Message-ID: Pine.GSO.4.63.0512072248520.13553@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

As Teodor already pointed there is no non-ambiguous solution, or
at least, we don't know it.

Oleg

On Wed, 7 Dec 2005, Andrew J. Kopciuch wrote:

>>> A & (B | (New OperatorTheNextWordMustFollow York))
>>
>
> I had thought about this before myself. Alas I have never had the time to
> properly investigate implementing such a feature.
>
> :(
>
> A & (B | (New + York))
>
> Something like that?
>
>> Actually, I love that idea. Oleg, would it be possible to create a
>> tsquery operator that understands proximity? Or, how allowing a
>> predicate to the current '&' op, as in '&[dist<=1]' meaning "next
>> token follows with a max distance of 1". I imagine that it would
>> only be useful on unstripped tsvectors, but if the lexem position is
>> already stored ...
>>
>
> Would the proximity go in both directions? Or just forward? What about tokens
> that come before? Just a thought.
>
>
>
> Andy
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Jeffrey Melloy <jmelloy(at)visualdistortion(dot)org>
To: Mike Rylander <mrylander(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-07 22:37:30
Message-ID: 4397642A.50406@visualdistortion.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Mike Rylander wrote:

>>
>>Mike Rylander wrote:
>>
>>
>>>On 12/6/05, Marcus Engene <mengpg(at)engene(dot)se> wrote:
>>>
>>>[snip]
>>>
>>>
>>>
>>>
>>>> A & (B | (New OperatorTheNextWordMustFollow York))
>>>>
>>>>
>>>>
>>>Actually, I love that idea. Oleg, would it be possible to create a
>>>tsquery operator that understands proximity? Or, how allowing a
>>>predicate to the current '&' op, as in '&[dist<=1]' meaning "next
>>>token follows with a max distance of 1". I imagine that it would
>>>only be useful on unstripped tsvectors, but if the lexem position is
>>>already stored ...
>>>
>>>
This might not be a solution in the longer term, but what I do for that
type of thing is

idxfti @@ '(a&b)' and message ~* 'a b'

Postgres is smart enough to use the results of the GIST index and go
from there with the message scanning.

Jeff


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Mike Rylander <mrylander(at)gmail(dot)com>
Cc: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-08 10:33:11
Message-ID: 43980BE7.9000601@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> hrm... that is a problem. Though, I think that's a case of how the
> compiled expression is built from user input. Unless I'm mistaken
>
> a + ( foo1 | foo2 )
>
> is exactly equal to
>
> (a + foo1) | (a + foo2)
>
>
> Ahhh... but then there is the more complex example of
>
> a + foonish + bar
>
> becoming
>
> a + (foo1 | foo2) + bar
>
> .... but I guess that could be
>
> (a + foo1 + bar) | (a + foo2 + bar)

That a simple case, what about languages as norwegian or german? They has
compound words and ispell dictionary can split them to lexemes. But, usialy
there is more than one variant of separation:

forbruksvaremerkelov
forbruk vare merke lov
forbruk vare merkelov
forbruk varemerke lov
forbruk varemerkelov
forbruksvare merke lov
forbruksvare merkelov
(notice: I don't know translation, just an example. When we working on compound
word support we found word which has 24 variant of separation!!)

So, query 'a + forbruksvaremerkelov' will be awful:

a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... )

Of course, that is examle just from mind, but solution of phrase search should
work reasonably with such corner cases.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Mike Rylander <mrylander(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-08 16:09:51
Message-ID: b918cf3d0512080809s1ecb1b2fn318ec886dbb1436e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 12/8/05, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
> > (a + foo1 + bar) | (a + foo2 + bar)
>
> That a simple case, what about languages as norwegian or german? They has
> compound words and ispell dictionary can split them to lexemes. But, usialy
> there is more than one variant of separation:
>
> forbruksvaremerkelov
> forbruk vare merke lov
> forbruk vare merkelov
> forbruk varemerke lov
> forbruk varemerkelov
> forbruksvare merke lov
> forbruksvare merkelov
> (notice: I don't know translation, just an example. When we working on compound
> word support we found word which has 24 variant of separation!!)
>
> So, query 'a + forbruksvaremerkelov' will be awful:
>
> a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... )
>
> Of course, that is examle just from mind, but solution of phrase search should
> work reasonably with such corner cases.
>

WARNING: What follows is wild, hand waving speculation as I don't
fully understand the implications of compound words! ;-)

My naive impression is that it would be both possible and a good idea
to stem any compound words to their versions containing the most
individual lexemes. As an analogy, this would be similar to
transforming composed (Normalization Form C) UTF-8 characters into
their decomposed (Normalization Form D) versions.

From your example above, the stemmed version of 'forbrukvaremerkelov'
would always decompose to 'forbruk vare merke lov', both for indexing
and in to_tsquery(). For the purposes of phrase searching, or more
generally proximity searching, the compiled query

a + forbrukvaremerkelov

might look something like

a + forbruk + vare + merke + lov

and that's it ... all parts of the compound word are required, and
required to be in that order, for the "phrase" search to be valid. A
compiled query like

a + (forbruk & vare & merke & lov)

wouldn't be valid anyway, because the user wants the entire compound
word to be adjacent to 'a', and the bare '&' op would allow any of the
parts to exist anywhere in the document ... or am I missing something?
(I probably am.)

The point is, once you go into an order-and-distance mode for two user
supplied words (pre-stemming) you have to apply that mode to the
entire set of stemmed lexemes that are involved in the "phrase". If
that assumption, that "user requested order and distance" uses a
different set of operators than free-form full text searching, then I
think it's doable. Each sub-statement that comprises a phrase search
is an atomic unit, and can be applied anywhere within the global
compiled query.

[Thinking ...]

Starting from that assumption, take the example of

a + foonish & bar

The implication of the above assumption is that the '+' (or
'&[follows;dist=1]') operator has higher precedence than a bare '&'
operator. So, the next version of the query, before compilation is
complete, might look like:

(a + foonish) & bar

Then we go through these steps:

(a + (foo1 | foo2)) & bar #decompose compound and multi-stem words
( (a + foo1) | (a + foo2) ) & bar # create multiple atoms for
multi-stem words

The end result is both non-ambiguous and reflects the most likely user
intended query. Let's try it with a compound word /and/ a multi-stem
word, remembering that "phrase operators" are only allowed between
simple query terms, not compound terms (grouped terms):

1) a & (foonish + forbrukvaremerkelov) & ! bar # user supplied query

2) a & ( (foo1 | foo2) + forbrukvaremerkelov) & ! bar # decompose
multi-stem words

3) a & ( (foo1 + forbrukvaremerkelov) | (foo2 + forbrukvaremerkelov)
) & ! bar # make multiple atoms from multi-stemmed words involved in
phrases (this creates 1 atom per stem per multi-stem word, and yes,
that could get very big... but, IMHO, slow but working corner cases
are OK)

4) a & ( (foo1 + forbruk + vare + merke + lov) | (foo2 + forbruk +
vare + merke + lov) ) & ! bar # explode the compound words to their
"decomposed" form, because that's what ought to be in the indexed data

That meets the same criteria as the simpler example above, and I've
not said anything about compound and multi-stem word outside the
"phrase mode" portion of the query because the current behaviour is
what we want in those cases.

>
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW: http://www.sigaev.ru/
>

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org


From: Marcus Engene <mengpg(at)engene(dot)se>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: fts, compond words?
Date: 2005-12-12 12:39:53
Message-ID: 439D6F99.7070809@engene.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> That a simple case, what about languages as norwegian or german? They
> has compound words and ispell dictionary can split them to lexemes.
> But, usialy there is more than one variant of separation:
>
> forbruksvaremerkelov
> forbruk vare merke lov
> forbruk vare merkelov
> forbruk varemerke lov
> forbruk varemerkelov
> forbruksvare merke lov
> forbruksvare merkelov
> (notice: I don't know translation, just an example. When we working
on > compound word support we found word which has 24 variant of
> separation!!)
>
> So, query 'a + forbruksvaremerkelov' will be awful:
>
> a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) |
... )
>
> Of course, that is examle just from mind, but solution of phrase
> search should work reasonably with such corner cases.

(Sorry for replying in the wrong place in the thread, I was away for a
trip and unsubscribed meanwhile)

I'm a swede and swedish is similair to norweigan and german. Take this
example:

lång hårig kvinna
långhårig kvinna

Words are put together to make a new word with different meaning. The
first example means "tall hairy woman" and the second is "woman with
long hair". If I would be on f.ex a date site, I'd want the distinction.
;-) If not, i should enter both strings
("lång hårig" | långhårig) & kvinna
...which is perfectly acceptable.

IMHO I don't see any point in splitting these words.

Let's go back to the subject, what about a syntax like this:

idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]')

Ie the exact match string is always atomic. Wouldn't that be doable
without any logical implications?

Best regards,
Marcus


From: Mike Rylander <mrylander(at)gmail(dot)com>
To: Marcus Engene <mengpg(at)engene(dot)se>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: fts, compond words?
Date: 2005-12-12 14:43:14
Message-ID: b918cf3d0512120643r50aab029p83950651c1eb01ec@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 12/12/05, Marcus Engene <mengpg(at)engene(dot)se> wrote:
> > That a simple case, what about languages as norwegian or german? They
> > has compound words and ispell dictionary can split them to lexemes.
> > But, usialy there is more than one variant of separation:
> >
> > forbruksvaremerkelov
> > forbruk vare merke lov
> > forbruk vare merkelov
> > forbruk varemerke lov
> > forbruk varemerkelov
> > forbruksvare merke lov
> > forbruksvare merkelov
> > (notice: I don't know translation, just an example. When we working
> on > compound word support we found word which has 24 variant of
> > separation!!)
> >
> > So, query 'a + forbruksvaremerkelov' will be awful:
> >
> > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) |
> ... )
> >
> > Of course, that is examle just from mind, but solution of phrase
> > search should work reasonably with such corner cases.
>
> (Sorry for replying in the wrong place in the thread, I was away for a
> trip and unsubscribed meanwhile)
>
> I'm a swede and swedish is similair to norweigan and german. Take this
> example:
>
> lång hårig kvinna
> långhårig kvinna
>
> Words are put together to make a new word with different meaning. The
> first example means "tall hairy woman" and the second is "woman with
> long hair". If I would be on f.ex a date site, I'd want the distinction.
> ;-) If not, i should enter both strings
> ("lång hårig" | långhårig) & kvinna
> ...which is perfectly acceptable.

Well, that certainly kills my initial naive implementation plan! :-)
Thank you for the explanation.

[thinking]

Well, if compound words should always be treated as the user has
inserted them then it seems that the current implementation may be
doing the wrong thing with regard to stemming compound words. If the
compound words are being decomposed to constituent stems then you'd be
getting semantically, or at least contextually, incorrect results,
right? (Again, not an expert here. :-) )

[thinking more...]

So, assuming that compound words should not be fully stemmed, due to
the way they are used to create new words with different meanings, if
step (4) were removed from my earlier plan then everything would
continue to work as proposed.

>
> IMHO I don't see any point in splitting these words.
>
>
> Let's go back to the subject, what about a syntax like this:
>
> idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]')
>
> Ie the exact match string is always atomic. Wouldn't that be doable
> without any logical implications?
>

I think there are several ways that phrase matching can be done in a
logically consistent way. That is certainly one of them, and takes
the focus off a single infix operator. TS2 already recognises
grouping operations via parens, and restricting brackets ([,]) to
surrounding only simple expressions (no '&', '|', '!' or '()')
shouldn't be too hard. However, I'd still prefer that proximity
searches could be specified more explicitly by the user. Using the
above example:

pizza & (Chicago | [New York])

becomes

pizza & (Chicago | New + Your)

which is implicitly

pizza & (Chicago | New +{follows;dist=1} York)

and that is read as: "Pizza, and chicago, or new followed by york at
a distance of 1."

where the modifier to the '+' operator could be specified by the user
initially if desired.

While I understand and agree that "phrase searching" would be the most
common use for proximity+direction operator modifiers, I see things
like the '+' operator and '[]' groupings as special cases of the more
generalized restriction operation (or set thereof) based on the
positional information recorded in (unstripped) indexes.

Thoughts?

> Best regards,
> Marcus
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org