Re: integrated tsearch has different results than tsearch2

Lists: pgsql-hackers
From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Subject: integrated tsearch has different results than tsearch2
Date: 2007-09-03 07:25:36
Message-ID: 162867790709030025n6448e224x6e86664316247133@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I am testing fulltext.

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).

2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.

Original (old) result

ts=# select * from ts_debug('Příliš žluťoučký kůň se napil žluté vody');
ts_name | tok_type | description | token | dict_name
| tsvector
--------------+----------+-------------+-----------+
-------------------+ ------------
default_czech | word | Word | Příliš |
{cz_ispell,simple} | 'příliš'
default_czech | word | Word | žluťoučký |
{cz_ispell,simple} | 'žluťoučký'
default_czech | word | Word | kůň | {cz_ispell,simple} | 'kůň'
default_czech | lword | Latin word | se | {cz_ispell,simple} |
default_czech | lword | Latin word | napil |
{cz_ispell,simple} | 'napít'
default_czech | word | Word | žluté |
{cz_ispell,simple} | 'žlutý'
default_czech | lword | Latin word | vody |
{cz_ispell,simple} | 'voda'
(7 řádek)

New results:
postgres=# create Text search dictionary cspell(template=ispell,
dictfile=czech, afffile=czech, stopwords=czech);
CREATE TEXT SEARCH DICTIONARY
postgres=# CREATE text search configuration cs (copy=english);
CREATE TEXT SEARCH CONFIGURATION

postgres=# alter text search configuration cs alter mapping for word,
lword with cspell, simple;
ALTER TEXT SEARCH CONFIGURATION
postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');
Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-----------+-----------------+---------------------
word | Word | Příliš | {cspell,simple} | cspell: {příliš}
blank | Space symbols | | {} |
word | Word | žluťoučký | {cspell,simple} | cspell: {žluťoučký}
blank | Space symbols | | {} |
word | Word | kůň | {cspell,simple} | cspell: {kůň}
blank | Space symbols | | {} |
lword | Latin word | se | {cspell,simple} | cspell: {}
blank | Space symbols | | {} |
lword | Latin word | napil | {cspell,simple} | simple: {napil}
blank | Space symbols | | {} |
word | Word | žluté | {cspell,simple} | simple: {žluté}
blank | Space symbols | | {} |
lword | Latin word | vody | {cspell,simple} | simple: {vody}
(13 rows)

This query returned true in 8.2 and now:

postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
?column?
----------
f
(1 row)

Regards
Pavel Stehule


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: integrated tsearch has different results than tsearch2
Date: 2007-09-03 08:46:25
Message-ID: Pine.LNX.4.64.0709031245430.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel,

I can't read your posting. Can you use plain text format ?

Oleg
On Mon, 3 Sep 2007, Pavel Stehule wrote:

> Hello
> I am testing fulltext.
> 1. I am not able use fulltext with latin2 encoding :( I missing noteabout only utf8 dictionaries in doc).
>
> 2. with hspell dictionaries (fresh copy from open office) I gotdifferent and wrong results.
> Original (old) result
> ts=# select * from ts_debug('P??li? ?lu?ou?k? k?? se napil ?lut? vody'); ts_name | tok_type | description | token | dict_name | tsvector --------------+----------+-------------+-----------+-------------------+ ------------ default_czech | word | Word | P??li? |{cz_ispell,simple} | 'p??li?' default_czech | word | Word | ?lu?ou?k? |{cz_ispell,simple} | '?lu?ou?k?' default_czech | word | Word | k?? | {cz_ispell,simple} | 'k??' default_czech | lword | Latin word | se | {cz_ispell,simple} | default_czech | lword | Latin word | napil |{cz_ispell,simple} | 'nap?t' default_czech | word | Word | ?lut? |{cz_ispell,simple} | '?lut?' default_czech | lword | Latin word | vody |{cz_ispell,simple} | 'voda' (7 ??dek)
> New results:postgres=# create Text search dictionary cspell(template=ispell,dictfile=czech, afffile=czech, stopwords=czech);CREATE TEXT SEARCH DICTIONARYpostgres=# CREATE text search configuration cs (copy=english);CREATE TEXT SEARCH CONFIGURATION
> postgres=# alter text search configuration cs alter mapping for word,lword with cspell, simple;ALTER TEXT SEARCH CONFIGURATIONpostgres=# select * from ts_debug('cs','P??li? ?lu?ou?k? k?? se napil?lut? vody'); Alias | Description | Token | Dictionaries | Lexized token-------+---------------+-----------+-----------------+--------------------- word | Word | P??li? | {cspell,simple} | cspell: {p??li?} blank | Space symbols | | {} | word | Word | ?lu?ou?k? | {cspell,simple} | cspell: {?lu?ou?k?} blank | Space symbols | | {} | word | Word | k?? | {cspell,simple} | cspell: {k??} blank | Space symbols | | {} | lword | Latin word | se | {cspell,simple} | cspell: {} blank | Space symbols | | {} | lword | Latin word | napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {} | word | Word | ?lut? | {cspell,simple} | simple: {?lut?} blank | Space symbols | | {} | lword | Latin word | vody | {cspell,simple} | simple: {vody}(13 rows)
> This query returned true in 8.2 and now:
> postgres=# select to_tsvector('cs','P??li? ?lut? k?? se napil ?lut?vody') @@ to_tsquery('cs','nap?t'); ?column?---------- f(1 row)
> RegardsPavel Stehule
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: integrated tsearch has different results than tsearch2
Date: 2007-09-03 10:24:31
Message-ID: 46DBE0DF.4010109@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> 1. I am not able use fulltext with latin2 encoding :( I missing note
> about only utf8 dictionaries in doc).
You can use any server encoding, but dictionary's files should be in utf8 -
dictionary will convert utf8 files into server encoding.

>
>
> 2. with hspell dictionaries (fresh copy from open office) I got
> different and wrong results.
> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> vody') @@ to_tsquery('cs','napít');
> ?column?
> ----------
> f
> (1 row)

Pls, output of:
select ts_lexize('cspell','napil');
select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody');

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc: "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: integrated tsearch has different results than tsearch2
Date: 2007-09-04 10:57:50
Message-ID: 162867790709040357w22ffa19pd5aabf917dadd48d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2007/9/3, Teodor Sigaev <teodor(at)sigaev(dot)ru>:
> > 1. I am not able use fulltext with latin2 encoding :( I missing note
> > about only utf8 dictionaries in doc).
> You can use any server encoding, but dictionary's files should be in utf8 -
> dictionary will convert utf8 files into server encoding.
>
> >
> >
> > 2. with hspell dictionaries (fresh copy from open office) I got
> > different and wrong results.
> > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> > vody') @@ to_tsquery('cs','napít');
> > ?column?
> > ----------
> > f
> > (1 row)
>
> Pls, output of:
> select ts_lexize('cspell','napil');
> select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> vody');
>
>
postgres=# select ts_lexize('cspell','napil');
ts_lexize
-----------

(1 row)
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
to_tsvector
-----------------------------------------------------------
'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
(1 row)

There is difference
8.2.x
postgres=# select lexize('cz_ispell','jablka');
lexize
----------
{jablko}
(1 row)
8.3
postgres=# select ts_lexize('cspell','jablka');
ts_lexize
-----------

(1 row)
postgres=# select ts_lexize('cspell','jablko');
ts_lexize
-----------
{jablko}
(1 row)

Pavel Stehule


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
Cc: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: integrated tsearch has different results than tsearch2
Date: 2007-09-04 11:14:02
Message-ID: 46DD3DFA.8020401@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule wrote:
> 2007/9/3, Teodor Sigaev <teodor(at)sigaev(dot)ru>:
>>> 1. I am not able use fulltext with latin2 encoding :( I missing note
>>> about only utf8 dictionaries in doc).
>> You can use any server encoding, but dictionary's files should be in utf8 -
>> dictionary will convert utf8 files into server encoding.
>>
>>>
>>> 2. with hspell dictionaries (fresh copy from open office) I got
>>> different and wrong results.
>>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
>>> vody') @@ to_tsquery('cs','napít');
>>> ?column?
>>> ----------
>>> f
>>> (1 row)
>> Pls, output of:
>> select ts_lexize('cspell','napil');
>> select to_tsvector('cs','Příliš žlutý kůň se napil žluté
>> vody');
>>
>>
> postgres=# select ts_lexize('cspell','napil');
> ts_lexize
> -----------
>
> (1 row)
> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
> to_tsvector
> -----------------------------------------------------------
> 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
> (1 row)
>
> There is difference
> 8.2.x
> postgres=# select lexize('cz_ispell','jablka');
> lexize
> ----------
> {jablko}
> (1 row)
> 8.3
> postgres=# select ts_lexize('cspell','jablka');
> ts_lexize
> -----------
>
> (1 row)
> postgres=# select ts_lexize('cspell','jablko');
> ts_lexize
> -----------
> {jablko}
> (1 row)

Can you post a link to the ispell dictionary file you're using so I and
others can reproduce that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: integrated tsearch has different results than tsearch2
Date: 2007-09-04 11:52:23
Message-ID: 162867790709040452o4f0f2558m37adb4219b3e7ed6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I used dictionaries from fedora core packages

hunspell-cs-20060303-5.fc7.i386.rpm

then I converted it to utf8 with iconv

Pavel

2007/9/4, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>:
> Pavel Stehule wrote:
> > 2007/9/3, Teodor Sigaev <teodor(at)sigaev(dot)ru>:
> >>> 1. I am not able use fulltext with latin2 encoding :( I missing note
> >>> about only utf8 dictionaries in doc).
> >> You can use any server encoding, but dictionary's files should be in utf8 -
> >> dictionary will convert utf8 files into server encoding.
> >>
> >>>
> >>> 2. with hspell dictionaries (fresh copy from open office) I got
> >>> different and wrong results.
> >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> >>> vody') @@ to_tsquery('cs','napít');
> >>> ?column?
> >>> ----------
> >>> f
> >>> (1 row)
> >> Pls, output of:
> >> select ts_lexize('cspell','napil');
> >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> >> vody');
> >>
> >>
> > postgres=# select ts_lexize('cspell','napil');
> > ts_lexize
> > -----------
> >
> > (1 row)
> > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
> > to_tsvector
> > -----------------------------------------------------------
> > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
> > (1 row)
> >
> > There is difference
> > 8.2.x
> > postgres=# select lexize('cz_ispell','jablka');
> > lexize
> > ----------
> > {jablko}
> > (1 row)
> > 8.3
> > postgres=# select ts_lexize('cspell','jablka');
> > ts_lexize
> > -----------
> >
> > (1 row)
> > postgres=# select ts_lexize('cspell','jablko');
> > ts_lexize
> > -----------
> > {jablko}
> > (1 row)
>
> Can you post a link to the ispell dictionary file you're using so I and
> others can reproduce that?
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
Cc: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: integrated tsearch has different results than tsearch2
Date: 2007-09-04 12:35:08
Message-ID: 46DD50FC.5010104@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule wrote:
> I used dictionaries from fedora core packages
>
> hunspell-cs-20060303-5.fc7.i386.rpm
>
> then I converted it to utf8 with iconv

Ok, thanks.

Apparently it's a bug I introduced when I refactored spell.c to use the
readline function for reading and recoding the input file. I didn't
notice that some calls to STRNCMP used the non-lowercased version of the
input line. Patch attached.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
spell-fix-1.patch text/x-diff 1.3 KB

From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: integrated tsearch has different results than tsearch2
Date: 2007-09-04 12:43:04
Message-ID: 162867790709040543m1468abefj1d02d528df92fd2f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2007/9/4, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>:
> Pavel Stehule wrote:
> > I used dictionaries from fedora core packages
> >
> > hunspell-cs-20060303-5.fc7.i386.rpm
> >
> > then I converted it to utf8 with iconv
>
> Ok, thanks.
>
> Apparently it's a bug I introduced when I refactored spell.c to use the
> readline function for reading and recoding the input file. I didn't
> notice that some calls to STRNCMP used the non-lowercased version of the
> input line. Patch attached.
>
> --

It works

Thank you
Pavel