Re: fts, compond words?

From: Mike Rylander <mrylander(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: POSTGRESQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: fts, compond words?
Date: 2005-12-08 16:09:51
Message-ID: b918cf3d0512080809s1ecb1b2fn318ec886dbb1436e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 12/8/05, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
> > (a + foo1 + bar) | (a + foo2 + bar)
>
> That a simple case, what about languages as norwegian or german? They has
> compound words and ispell dictionary can split them to lexemes. But, usialy
> there is more than one variant of separation:
>
> forbruksvaremerkelov
> forbruk vare merke lov
> forbruk vare merkelov
> forbruk varemerke lov
> forbruk varemerkelov
> forbruksvare merke lov
> forbruksvare merkelov
> (notice: I don't know translation, just an example. When we working on compound
> word support we found word which has 24 variant of separation!!)
>
> So, query 'a + forbruksvaremerkelov' will be awful:
>
> a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... )
>
> Of course, that is examle just from mind, but solution of phrase search should
> work reasonably with such corner cases.
>

WARNING: What follows is wild, hand waving speculation as I don't
fully understand the implications of compound words! ;-)

My naive impression is that it would be both possible and a good idea
to stem any compound words to their versions containing the most
individual lexemes. As an analogy, this would be similar to
transforming composed (Normalization Form C) UTF-8 characters into
their decomposed (Normalization Form D) versions.

From your example above, the stemmed version of 'forbrukvaremerkelov'
would always decompose to 'forbruk vare merke lov', both for indexing
and in to_tsquery(). For the purposes of phrase searching, or more
generally proximity searching, the compiled query

a + forbrukvaremerkelov

might look something like

a + forbruk + vare + merke + lov

and that's it ... all parts of the compound word are required, and
required to be in that order, for the "phrase" search to be valid. A
compiled query like

a + (forbruk & vare & merke & lov)

wouldn't be valid anyway, because the user wants the entire compound
word to be adjacent to 'a', and the bare '&' op would allow any of the
parts to exist anywhere in the document ... or am I missing something?
(I probably am.)

The point is, once you go into an order-and-distance mode for two user
supplied words (pre-stemming) you have to apply that mode to the
entire set of stemmed lexemes that are involved in the "phrase". If
that assumption, that "user requested order and distance" uses a
different set of operators than free-form full text searching, then I
think it's doable. Each sub-statement that comprises a phrase search
is an atomic unit, and can be applied anywhere within the global
compiled query.

[Thinking ...]

Starting from that assumption, take the example of

a + foonish & bar

The implication of the above assumption is that the '+' (or
'&[follows;dist=1]') operator has higher precedence than a bare '&'
operator. So, the next version of the query, before compilation is
complete, might look like:

(a + foonish) & bar

Then we go through these steps:

(a + (foo1 | foo2)) & bar #decompose compound and multi-stem words
( (a + foo1) | (a + foo2) ) & bar # create multiple atoms for
multi-stem words

The end result is both non-ambiguous and reflects the most likely user
intended query. Let's try it with a compound word /and/ a multi-stem
word, remembering that "phrase operators" are only allowed between
simple query terms, not compound terms (grouped terms):

1) a & (foonish + forbrukvaremerkelov) & ! bar # user supplied query

2) a & ( (foo1 | foo2) + forbrukvaremerkelov) & ! bar # decompose
multi-stem words

3) a & ( (foo1 + forbrukvaremerkelov) | (foo2 + forbrukvaremerkelov)
) & ! bar # make multiple atoms from multi-stemmed words involved in
phrases (this creates 1 atom per stem per multi-stem word, and yes,
that could get very big... but, IMHO, slow but working corner cases
are OK)

4) a & ( (foo1 + forbruk + vare + merke + lov) | (foo2 + forbruk +
vare + merke + lov) ) & ! bar # explode the compound words to their
"decomposed" form, because that's what ought to be in the indexed data

That meets the same criteria as the simpler example above, and I've
not said anything about compound and multi-stem word outside the
"phrase mode" portion of the query because the current behaviour is
what we want in those cases.

>
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW: http://www.sigaev.ru/
>

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2005-12-08 16:19:08 Re: is it possible to delete the psql log while psql is
Previous Message Writer's Digest 2005-12-08 16:00:00 Special message brought to you by Writer's Digest