Re: How to switch off Snowball stemmer for tsearch2?

Lists: pgsql-general
From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-22 18:10:06
Message-ID: d7df81620708221110s6adedb07g9b5f93b8f8c3c38e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello.

We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
Now Snowball stemmer is also configured.

How to properly switch OFF Snowball stemmer for Russian without turning off
ispell stemmer? (It is really needed, because "Ivanov" is not the same as
"Ivan".)
Is it enough and correct to simply delete the row from pg_ts_dict or not?

Here is the dump of pg_ts_dict table:

dict_name dict_init dict_initoption dict_lexize dict_comment
en_ispell spell_init(internal)
DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
spell_lexize(internal,internal,integer)
en_stem snb_en_init(internal) contrib/english.stop
snb_lexize(internal,internal,integer) English Stemmer. Snowball.
ispell_template spell_init(internal)
spell_lexize(internal,internal,integer) ISpell interface. Must have .dict
and .aff files
ru_ispell_cp1251 spell_init(internal)
DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
spell_lexize(internal,internal,integer)
ru_stem_cp1251 snb_ru_init_cp1251(internal)
contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. KOI8
Encoding
ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. UTF8
Encoding
simple dex_init(internal) dex_lexize(internal,internal,integer)
Simple example of dictionary.
synonym syn_init(internal)
syn_lexize(internal,internal,integer) Example of synonym dictionary
thesaurus_template thesaurus_init(internal)
thesaurus_lexize(internal,internal,integer,internal) Thesaurus template,
must be pointed Dictionary and DictFile


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Dmitry Koterov <dmitry(at)koterov(dot)ru>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-22 18:46:59
Message-ID: Pine.LNX.4.64.0708222244400.2727@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 22 Aug 2007, Dmitry Koterov wrote:

> Hello.
>
> We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
> Now Snowball stemmer is also configured.
>
> How to properly switch OFF Snowball stemmer for Russian without turning off
> ispell stemmer? (It is really needed, because "Ivanov" is not the same as
> "Ivan".)
> Is it enough and correct to simply delete the row from pg_ts_dict or not?
>
> Here is the dump of pg_ts_dict table:

don't use dump, plain select would be better. In your case, I'd
suggest to follow standard way - create synonym file like
ivanov ivanov
and use it before other dictionaries. Synonym dictionary will recognize
'Ivanov' and return 'ivanov'.

>
> dict_name dict_init dict_initoption dict_lexize dict_comment
> en_ispell spell_init(internal)
> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
> spell_lexize(internal,internal,integer)
> en_stem snb_en_init(internal) contrib/english.stop
> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
> ispell_template spell_init(internal)
> spell_lexize(internal,internal,integer) ISpell interface. Must have .dict
> and .aff files
> ru_ispell_cp1251 spell_init(internal)
> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
> spell_lexize(internal,internal,integer)
> ru_stem_cp1251 snb_ru_init_cp1251(internal)
> contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
> Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
> ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. KOI8
> Encoding
> ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. UTF8
> Encoding
> simple dex_init(internal) dex_lexize(internal,internal,integer)
> Simple example of dictionary.
> synonym syn_init(internal)
> syn_lexize(internal,internal,integer) Example of synonym dictionary
> thesaurus_template thesaurus_init(internal)
> thesaurus_lexize(internal,internal,integer,internal) Thesaurus template,
> must be pointed Dictionary and DictFile
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-22 19:21:54
Message-ID: d7df81620708221221h30a575c7m292de73bfa34e6fc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Suppose I cannot add such synonyms, because:

1. There are a lot of surnames, cannot take care about all of them.
2. After adding a new surname I have to re-calculate all full-text indices,
it costs too much (about 10 days to complete the recalculation).

So, I neet exactly what I ast - switch OFF stem guessing if a word is not in
the dictionary.

On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>
> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>
> > Hello.
> >
> > We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
> > Now Snowball stemmer is also configured.
> >
> > How to properly switch OFF Snowball stemmer for Russian without turning
> off
> > ispell stemmer? (It is really needed, because "Ivanov" is not the same
> as
> > "Ivan".)
> > Is it enough and correct to simply delete the row from pg_ts_dict or
> not?
> >
> > Here is the dump of pg_ts_dict table:
>
> don't use dump, plain select would be better. In your case, I'd
> suggest to follow standard way - create synonym file like
> ivanov ivanov
> and use it before other dictionaries. Synonym dictionary will recognize
> 'Ivanov' and return 'ivanov'.
>
> >
> > dict_name dict_init dict_initoption dict_lexize dict_comment
> > en_ispell spell_init(internal)
> >
> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
> > spell_lexize(internal,internal,integer)
> > en_stem snb_en_init(internal) contrib/english.stop
> > snb_lexize(internal,internal,integer) English Stemmer. Snowball.
> > ispell_template spell_init(internal)
> > spell_lexize(internal,internal,integer) ISpell interface. Must have
> .dict
> > and .aff files
> > ru_ispell_cp1251 spell_init(internal)
> >
> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
> > spell_lexize(internal,internal,integer)
> > ru_stem_cp1251 snb_ru_init_cp1251(internal)
> > contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
> > Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
> > ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
> > snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. KOI8
> > Encoding
> > ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
> > snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. UTF8
> > Encoding
> >
> simple dex_init(internal) dex_lexize(internal,internal,integer)
> > Simple example of dictionary.
> > synonym syn_init(internal)
> > syn_lexize(internal,internal,integer) Example of synonym dictionary
> > thesaurus_template thesaurus_init(internal)
> > thesaurus_lexize(internal,internal,integer,internal) Thesaurus
> template,
> > must be pointed Dictionary and DictFile
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Dmitry Koterov <dmitry(at)koterov(dot)ru>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-22 19:33:04
Message-ID: Pine.LNX.4.64.0708222330040.2727@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 22 Aug 2007, Dmitry Koterov wrote:

> Suppose I cannot add such synonyms, because:
>
> 1. There are a lot of surnames, cannot take care about all of them.
> 2. After adding a new surname I have to re-calculate all full-text indices,
> it costs too much (about 10 days to complete the recalculation).
>
> So, I neet exactly what I ast - switch OFF stem guessing if a word is not in
> the dictionary.

no problem, just modify pg_ts_cfgmap, which contains mapping
token - dictionaries.

if you change configuration you should rebuild tsvector and reindex.
10 days looks very suspicious.

>
> On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>
>> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>>
>>> Hello.
>>>
>>> We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
>>> Now Snowball stemmer is also configured.
>>>
>>> How to properly switch OFF Snowball stemmer for Russian without turning
>> off
>>> ispell stemmer? (It is really needed, because "Ivanov" is not the same
>> as
>>> "Ivan".)
>>> Is it enough and correct to simply delete the row from pg_ts_dict or
>> not?
>>>
>>> Here is the dump of pg_ts_dict table:
>>
>> don't use dump, plain select would be better. In your case, I'd
>> suggest to follow standard way - create synonym file like
>> ivanov ivanov
>> and use it before other dictionaries. Synonym dictionary will recognize
>> 'Ivanov' and return 'ivanov'.
>>
>>>
>>> dict_name dict_init dict_initoption dict_lexize dict_comment
>>> en_ispell spell_init(internal)
>>>
>> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
>>> spell_lexize(internal,internal,integer)
>>> en_stem snb_en_init(internal) contrib/english.stop
>>> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
>>> ispell_template spell_init(internal)
>>> spell_lexize(internal,internal,integer) ISpell interface. Must have
>> .dict
>>> and .aff files
>>> ru_ispell_cp1251 spell_init(internal)
>>>
>> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
>>> spell_lexize(internal,internal,integer)
>>> ru_stem_cp1251 snb_ru_init_cp1251(internal)
>>> contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
>>> Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
>>> ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
>>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. KOI8
>>> Encoding
>>> ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
>>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. UTF8
>>> Encoding
>>>
>> simple dex_init(internal) dex_lexize(internal,internal,integer)
>>> Simple example of dictionary.
>>> synonym syn_init(internal)
>>> syn_lexize(internal,internal,integer) Example of synonym dictionary
>>> thesaurus_template thesaurus_init(internal)
>>> thesaurus_lexize(internal,internal,integer,internal) Thesaurus
>> template,
>>> must be pointed Dictionary and DictFile
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>> message can get through to the mailing list cleanly
>>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Ivan Zolotukhin" <ivan(dot)zolotukhin(at)gmail(dot)com>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-22 20:14:52
Message-ID: 751e56400708221314l36b8289i8bf9818d7185af0d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

10 days is not suspicious at all if you need to pull out text for
indexing using complex logic and/or schema (i.e. most of the time you
retrieve text, not index it). Example: you index some tree leaves
(i.e. table with 3 columns: id, parent_id and name) and want to have
redundant text index. You therefore need to retrive all leaf's
predecessors before doing to_tsvector(), something like that.

On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>
> > Suppose I cannot add such synonyms, because:
> >
> > 1. There are a lot of surnames, cannot take care about all of them.
> > 2. After adding a new surname I have to re-calculate all full-text indices,
> > it costs too much (about 10 days to complete the recalculation).
> >
> > So, I neet exactly what I ast - switch OFF stem guessing if a word is not in
> > the dictionary.
>
> no problem, just modify pg_ts_cfgmap, which contains mapping
> token - dictionaries.
>
> if you change configuration you should rebuild tsvector and reindex.
> 10 days looks very suspicious.
>
>
> >
> > On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> >>
> >> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
> >>
> >>> Hello.
> >>>
> >>> We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
> >>> Now Snowball stemmer is also configured.
> >>>
> >>> How to properly switch OFF Snowball stemmer for Russian without turning
> >> off
> >>> ispell stemmer? (It is really needed, because "Ivanov" is not the same
> >> as
> >>> "Ivan".)
> >>> Is it enough and correct to simply delete the row from pg_ts_dict or
> >> not?
> >>>
> >>> Here is the dump of pg_ts_dict table:
> >>
> >> don't use dump, plain select would be better. In your case, I'd
> >> suggest to follow standard way - create synonym file like
> >> ivanov ivanov
> >> and use it before other dictionaries. Synonym dictionary will recognize
> >> 'Ivanov' and return 'ivanov'.
> >>
> >>>
> >>> dict_name dict_init dict_initoption dict_lexize dict_comment
> >>> en_ispell spell_init(internal)
> >>>
> >> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
> >>> spell_lexize(internal,internal,integer)
> >>> en_stem snb_en_init(internal) contrib/english.stop
> >>> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
> >>> ispell_template spell_init(internal)
> >>> spell_lexize(internal,internal,integer) ISpell interface. Must have
> >> .dict
> >>> and .aff files
> >>> ru_ispell_cp1251 spell_init(internal)
> >>>
> >> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
> >>> spell_lexize(internal,internal,integer)
> >>> ru_stem_cp1251 snb_ru_init_cp1251(internal)
> >>> contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
> >>> Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
> >>> ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
> >>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. KOI8
> >>> Encoding
> >>> ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
> >>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball. UTF8
> >>> Encoding
> >>>
> >> simple dex_init(internal) dex_lexize(internal,internal,integer)
> >>> Simple example of dictionary.
> >>> synonym syn_init(internal)
> >>> syn_lexize(internal,internal,integer) Example of synonym dictionary
> >>> thesaurus_template thesaurus_init(internal)
> >>> thesaurus_lexize(internal,internal,integer,internal) Thesaurus
> >> template,
> >>> must be pointed Dictionary and DictFile
> >>>
> >>
> >> Regards,
> >> Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 1: if posting/reading through Usenet, please send an appropriate
> >> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> >> message can get through to the mailing list cleanly
> >>
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-22 22:32:34
Message-ID: d7df81620708221532x4a4d62f6k6c0f0923df413771@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Oh! Thanks!

delete from pg_ts_cfgmap where dict_name = ARRAY['ru_stem'];

solves the root of the problem. But unfortunately
russian.med(ru_ispell_cp1251) contains all Russian names, so "Ivanov"
is converted to
"Ivan" by ispell too. :-(

Now

select lexize('ru_ispell_cp1251', 'Дмитриев') -> "Дмитрий"
select lexize('ru_ispell_cp1251', 'Иванов') -> "Иван"
- it is completely wrong!

I have a database with all Russian name, is it possible to use it (how?) to
make lexize() not to convert "Ivanov" to "Ivan" even if the ispell
dicrionary contains an element for "Ivan"? So, this pseudo-code logic is
needed:

function new_lexize($string) {
$stem = lexize('ru_ispell_cp1251', $string);
if ($stem in names_database) return $string; else return $stem;
}

Maybe tsearch2 implements this logic already?

On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>
> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>
> > Suppose I cannot add such synonyms, because:
> >
> > 1. There are a lot of surnames, cannot take care about all of them.
> > 2. After adding a new surname I have to re-calculate all full-text
> indices,
> > it costs too much (about 10 days to complete the recalculation).
> >
> > So, I neet exactly what I ast - switch OFF stem guessing if a word is
> not in
> > the dictionary.
>
> no problem, just modify pg_ts_cfgmap, which contains mapping
> token - dictionaries.
>
> if you change configuration you should rebuild tsvector and reindex.
> 10 days looks very suspicious.
>
>
> >
> > On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> >>
> >> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
> >>
> >>> Hello.
> >>>
> >>> We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
> >>> Now Snowball stemmer is also configured.
> >>>
> >>> How to properly switch OFF Snowball stemmer for Russian without
> turning
> >> off
> >>> ispell stemmer? (It is really needed, because "Ivanov" is not the same
> >> as
> >>> "Ivan".)
> >>> Is it enough and correct to simply delete the row from pg_ts_dict or
> >> not?
> >>>
> >>> Here is the dump of pg_ts_dict table:
> >>
> >> don't use dump, plain select would be better. In your case, I'd
> >> suggest to follow standard way - create synonym file like
> >> ivanov ivanov
> >> and use it before other dictionaries. Synonym dictionary will recognize
> >> 'Ivanov' and return 'ivanov'.
> >>
> >>>
> >>>
> dict_name dict_init dict_initoption dict_lexize dict_comment
> >>> en_ispell spell_init(internal)
> >>>
> >>
> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
> >>> spell_lexize(internal,internal,integer)
> >>> en_stem snb_en_init(internal) contrib/english.stop
> >>> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
> >>> ispell_template spell_init(internal)
> >>> spell_lexize(internal,internal,integer) ISpell interface. Must have
> >> .dict
> >>> and .aff files
> >>> ru_ispell_cp1251 spell_init(internal)
> >>>
> >>
> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
> >>> spell_lexize(internal,internal,integer)
> >>> ru_stem_cp1251 snb_ru_init_cp1251(internal)
> >>> contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
> >>> Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
> >>> ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
> >>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
> KOI8
> >>> Encoding
> >>>
> ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
> >>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
> UTF8
> >>> Encoding
> >>>
> >>
> simple dex_init(internal) dex_lexize(internal,internal,integer)
> >>> Simple example of dictionary.
> >>> synonym syn_init(internal)
> >>> syn_lexize(internal,internal,integer) Example of synonym dictionary
> >>> thesaurus_template thesaurus_init(internal)
> >>> thesaurus_lexize(internal,internal,integer,internal) Thesaurus
> >> template,
> >>> must be pointed Dictionary and DictFile
> >>>
> >>
> >> Regards,
> >> Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >>
> >> ---------------------------(end of
> broadcast)---------------------------
> >> TIP 1: if posting/reading through Usenet, please send an appropriate
> >> subscribe-nomail command to majordomo(at)postgresql(dot)org so that
> your
> >> message can get through to the mailing list cleanly
> >>
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Dmitry Koterov <dmitry(at)koterov(dot)ru>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-23 05:27:58
Message-ID: Pine.LNX.4.64.0708230925240.2727@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, 23 Aug 2007, Dmitry Koterov wrote:

> Oh! Thanks!
>
> delete from pg_ts_cfgmap where dict_name = ARRAY['ru_stem'];
>
> solves the root of the problem. But unfortunately
> russian.med(ru_ispell_cp1251) contains all Russian names, so "Ivanov"
> is converted to
> "Ivan" by ispell too. :-(
>
> Now
>
> select lexize('ru_ispell_cp1251', 'Дмитриев') -> "Дмитрий"
> select lexize('ru_ispell_cp1251', 'Иванов') -> "Иван"
> - it is completely wrong!
>
> I have a database with all Russian name, is it possible to use it (how?) to

if you have such database why just don't write special dictionary and
put it in front ?

> make lexize() not to convert "Ivanov" to "Ivan" even if the ispell
> dicrionary contains an element for "Ivan"? So, this pseudo-code logic is
> needed:
>
> function new_lexize($string) {
> $stem = lexize('ru_ispell_cp1251', $string);
> if ($stem in names_database) return $string; else return $stem;
> }
>
> Maybe tsearch2 implements this logic already?

sure, it's how text search mapping works. Dmitry, seems your company could be
my client :)

>
> On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>
>> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>>
>>> Suppose I cannot add such synonyms, because:
>>>
>>> 1. There are a lot of surnames, cannot take care about all of them.
>>> 2. After adding a new surname I have to re-calculate all full-text
>> indices,
>>> it costs too much (about 10 days to complete the recalculation).
>>>
>>> So, I neet exactly what I ast - switch OFF stem guessing if a word is
>> not in
>>> the dictionary.
>>
>> no problem, just modify pg_ts_cfgmap, which contains mapping
>> token - dictionaries.
>>
>> if you change configuration you should rebuild tsvector and reindex.
>> 10 days looks very suspicious.
>>
>>
>>>
>>> On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>>>
>>>> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
>>>>> Now Snowball stemmer is also configured.
>>>>>
>>>>> How to properly switch OFF Snowball stemmer for Russian without
>> turning
>>>> off
>>>>> ispell stemmer? (It is really needed, because "Ivanov" is not the same
>>>> as
>>>>> "Ivan".)
>>>>> Is it enough and correct to simply delete the row from pg_ts_dict or
>>>> not?
>>>>>
>>>>> Here is the dump of pg_ts_dict table:
>>>>
>>>> don't use dump, plain select would be better. In your case, I'd
>>>> suggest to follow standard way - create synonym file like
>>>> ivanov ivanov
>>>> and use it before other dictionaries. Synonym dictionary will recognize
>>>> 'Ivanov' and return 'ivanov'.
>>>>
>>>>>
>>>>>
>> dict_name dict_init dict_initoption dict_lexize dict_comment
>>>>> en_ispell spell_init(internal)
>>>>>
>>>>
>> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
>>>>> spell_lexize(internal,internal,integer)
>>>>> en_stem snb_en_init(internal) contrib/english.stop
>>>>> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
>>>>> ispell_template spell_init(internal)
>>>>> spell_lexize(internal,internal,integer) ISpell interface. Must have
>>>> .dict
>>>>> and .aff files
>>>>> ru_ispell_cp1251 spell_init(internal)
>>>>>
>>>>
>> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
>>>>> spell_lexize(internal,internal,integer)
>>>>> ru_stem_cp1251 snb_ru_init_cp1251(internal)
>>>>> contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
>>>>> Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
>>>>> ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
>>>>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
>> KOI8
>>>>> Encoding
>>>>>
>> ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
>>>>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
>> UTF8
>>>>> Encoding
>>>>>
>>>>
>> simple dex_init(internal) dex_lexize(internal,internal,integer)
>>>>> Simple example of dictionary.
>>>>> synonym syn_init(internal)
>>>>> syn_lexize(internal,internal,integer) Example of synonym dictionary
>>>>> thesaurus_template thesaurus_init(internal)
>>>>> thesaurus_lexize(internal,internal,integer,internal) Thesaurus
>>>> template,
>>>>> must be pointed Dictionary and DictFile
>>>>>
>>>>
>>>> Regards,
>>>> Oleg
>>>> _____________________________________________________________
>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>>
>>>> ---------------------------(end of
>> broadcast)---------------------------
>>>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that
>> your
>>>> message can get through to the mailing list cleanly
>>>>
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-23 09:56:46
Message-ID: d7df81620708230256m292ae23fk3aeb1c9c9e756c6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>
> > Now
> >
> > select lexize('ru_ispell_cp1251', 'Дмитриев') -> "Дмитрий"
> > select lexize('ru_ispell_cp1251', 'Иванов') -> "Иван"
> > - it is completely wrong!
> >
> > I have a database with all Russian name, is it possible to use it (how?)
> to
>
> if you have such database why just don't write special dictionary and
> put it in front ?

Of course because this is a database of Russian NAMES, but NOT a database of
surnames.

> make lexize() not to convert "Ivanov" to "Ivan" even if the ispell
> > dicrionary contains an element for "Ivan"? So, this pseudo-code logic is
> > needed:
> >
> > function new_lexize($string) {
> > $stem = lexize('ru_ispell_cp1251', $string);
> > if ($stem in names_database) return $string; else return $stem;
> > }
> >
> > Maybe tsearch2 implements this logic already?
>
> sure, it's how text search mapping works.

Could you please detalize?

Of course I can create all word-forms of all Russian names using ispell and
then - subtract this full list from Ispell dictionary (so I will remove
"Ivan", "Ivanami" etc. from it). But possily tsearch2 has this subtraction
algorythm already.

> Dmitry, seems your company could be my client :)

Not now, thank you. Maybe later.


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Dmitry Koterov <dmitry(at)koterov(dot)ru>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-23 12:05:27
Message-ID: Pine.LNX.4.64.0708231556590.2727@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, 23 Aug 2007, Dmitry Koterov wrote:

>>
>>> Now
>>>
>>> select lexize('ru_ispell_cp1251', 'Дмитриев') -> "Дмитрий"
>>> select lexize('ru_ispell_cp1251', 'Иванов') -> "Иван"
>>> - it is completely wrong!
>>>
>>> I have a database with all Russian name, is it possible to use it (how?)
>> to
>>
>> if you have such database why just don't write special dictionary and
>> put it in front ?
>
>
> Of course because this is a database of Russian NAMES, but NOT a database of
> surnames.
>
>
>> make lexize() not to convert "Ivanov" to "Ivan" even if the ispell
>>> dicrionary contains an element for "Ivan"? So, this pseudo-code logic is
>>> needed:
>>>
>>> function new_lexize($string) {
>>> $stem = lexize('ru_ispell_cp1251', $string);
>>> if ($stem in names_database) return $string; else return $stem;
>>> }
>>>
>>> Maybe tsearch2 implements this logic already?

write your own dictionary, which implements any logic you need. In your
case it's just a wrapper around ispell, which will returns original string
not stem. See example
http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html
and russian article
http://www.sai.msu.su/~megera/postgres/talks/fts_pgsql_intro.html#ftsdict

>>
>> sure, it's how text search mapping works.
>
>
> Could you please detalize?

you create dictionary surnames_dict and configure
pg_ts_cfgmap to process token of type nlword by
surnames_dict, ru_ispell, ru_stem, for example.

>
> Of course I can create all word-forms of all Russian names using ispell and
> then - subtract this full list from Ispell dictionary (so I will remove
> "Ivan", "Ivanami" etc. from it). But possily tsearch2 has this subtraction
> algorythm already.
>

don't do that ! Just go plain way.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-23 13:10:01
Message-ID: d7df81620708230610p5d20d009md959e6885c5c0aa3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>
> write your own dictionary, which implements any logic you need. In your
> case it's just a wrapper around ispell, which will returns original string
> not stem. See example
> http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html
> and russian article
> http://www.sai.msu.su/~megera/postgres/talks/fts_pgsql_intro.html#ftsdict

Ah, I understand you!
You offer to write a small Postgres contrib module (new dictionary) in C and
implement all logic in it.
Seems it's a bit complex solution for such a simple task (exclude surnames
for lexization), but - it could be implemented, of course.

> > Of course I can create all word-forms of all Russian names using ispell
> and
> > then - subtract this full list from Ispell dictionary (so I will remove
> > "Ivan", "Ivanami" etc. from it). But possily tsearch2 has this
> subtraction
> > algorythm already.
> >
>
> don't do that ! Just go plain way.
>

Another method is to generate a singular ** synonym dictionary based on all
Russian names word-forms using ispell (we will get all suspicous surnames in
this set) and add it before ispell. This solution does not need to write
anything in C.