Re: Tsearch + polish ispell + polish locale

Lists: pgsql-hackers
From: <arkadiusz(dot)staron(at)dreamlab(dot)pl>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Tsearch + polish ispell + polish locale
Date: 2006-11-20 10:19:34
Message-ID: EA6A3F5C1E4BC14D91D93A344436440C010D300E@MXMBON01.grupa.onet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

I am experiencing strange problem using tsearch with polish locale on (initdb -locale pl_PL.iso88592) and polish ispell dictionary.

I have a pl/pgSQL function that creates tsvector for a given record (it basically gets texts from various tables and creates one tsvector)

The function returns semething like his:

RETURN setweight(to_tsvector(fname), ''A'')

|| setweight(to_tsvector(prov), ''C'')

[ ... 15 more lines like above ... ]

|| setweight(to_tsvector(firm_rec.fax), ''A'')

;

After several calls to this function I get an error:

psql> update some_table set fts_vect = record_to_tsvector(id) where id < 40;

ERROR: Error in regis: [^ż]ać at pos 3

Any idea show can I fix this ?

What is even more strange lower() function gets broken *after* this error occurs.

Before the error it correctly lowers polish letters, and after it does not lowercase them anymore.

After reconnecting to the database everything works fine (untill next error...)

Regards,

Arek.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: arkadiusz(dot)staron(at)dreamlab(dot)pl
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-20 14:07:57
Message-ID: 4561B6BD.3010402@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> ERROR: Error in regis: [^ż]ać at pos 3
> Any idea show can I fix this ?
> What is even more strange lower() function gets broken **after** this
> error occurs.
>
> Before the error it correctly lowers polish letters, and after it does
> not lowercase them anymore.
>
> After reconnecting to the database everything works fine (untill next
> error…)

Which version do you use?

I just fix some bug near to your problem in current CVS - try new version.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: <arkadiusz(dot)staron(at)dreamlab(dot)pl>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-20 14:12:40
Message-ID: EA6A3F5C1E4BC14D91D93A344436440C010D3094@MXMBON01.grupa.onet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>
>> After reconnecting to the database everything works fine (untill next
>> error...)

> Which version do you use?
>
> I just fix some bug near to your problem in current CVS - try new version.

I am using version 8.1.5

I will try and let you know...

Thanks for your answer,
Arek.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: arkadiusz(dot)staron(at)dreamlab(dot)pl
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-20 14:20:36
Message-ID: 4561B9B4.1030702@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I am using version 8.1.5
Oops, I worked on 8.2.

Can you send ispell files (dict and affix) to me? And make simple test suite to
demonstrate the problem.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: <arkadiusz(dot)staron(at)dreamlab(dot)pl>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-20 15:30:49
Message-ID: EA6A3F5C1E4BC14D91D93A344436440C010D30D8@MXMBON01.grupa.onet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I am using ispell files from openoffice (converted with my2ispell).
I also tried other (eg. http://www.kurnik.pl/dictionary/) with the same result..

As for the test suite, it will take some time I think to prepare one..
I will send one as soon as possibile.

I think I will first try to port locale fix into 8.1 and see how it Works ...

Thanks,
Arek.

-----Original Message-----
From: Teodor Sigaev [mailto:teodor(at)sigaev(dot)ru]
Sent: Monday, November 20, 2006 3:21 PM
To: Staroń Arkadiusz
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Tsearch + polish ispell + polish locale

> I am using version 8.1.5
Oops, I worked on 8.2.

Can you send ispell files (dict and affix) to me? And make simple test suite to
demonstrate the problem.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: <arkadiusz(dot)staron(at)dreamlab(dot)pl>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-21 17:45:43
Message-ID: EA6A3F5C1E4BC14D91D93A344436440C010D32C2@MXMBON01.grupa.onet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Teodor,

Unfortunately I can't create test suite ...
I tried to create it as simple as possibile, but on simple (small) database everything works fine.
I also cannot provide you mirror of my database since it contains proprietary data ...

I solved my problem by creating my own tolower() function and replace it over the tsearch2 code.
On database with locale set to 'C' it works fine.

As far As I debugged the problem I could observe that with locale = 'C' RS_compile() is fed only with strings that does not contain polish letters.
With locale set to 'pl_PL.iso88592' strings passed to PS_compile contain polish letters.
I do not know how, but in some strange, random cases function isalpha() stops return true value for polish letters, and that is when RS_compile() returns error.

I will try to compile and run my database on the CVS version of postgres, and let you know the results.

Is it safe to use 8.2 version over 8.1.5 database files ?

BTW. When the official 8.2 release is expected ?

Thanks for your time and engagement,
Arek.

PS. BTW I have found minor inconsistency in the regis.c code (CVS version)
Return value type is not as it should .. see snippet below...

170 bool
171 RS_execute(Regis * r, char *str)
[...]
183 >>>>>>>>if (len < r->nchar)
184 >>>>>>>>>>>>>>>>return 0;


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: arkadiusz(dot)staron(at)dreamlab(dot)pl
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-21 18:33:26
Message-ID: 45634676.7070004@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I solved my problem by creating my own tolower() function and replace it over the tsearch2 code.
> On database with locale set to 'C' it works fine.
>
> As far As I debugged the problem I could observe that with locale = 'C' RS_compile() is fed only with strings that does not contain polish letters.
> With locale set to 'pl_PL.iso88592' strings passed to PS_compile contain polish letters.
> I do not know how, but in some strange, random cases function isalpha() stops return true value for polish letters, and that is when RS_compile() returns error.
Hmm, very strange. Which OS do you use?
Pls, show exact
# show lc_ctype;
# show lc_collate;
and tsearch2 configuration

>
> I will try to compile and run my database on the CVS version of postgres, and let you know the results.
ok

> Is it safe to use 8.2 version over 8.1.5 database files ?
No, it's impossible due to significant format of db's files change.

>
> BTW. When the official 8.2 release is expected ?

During 2006 :)

>
> Thanks for your time and engagement,
> Arek.
>
> PS. BTW I have found minor inconsistency in the regis.c code (CVS version)
> Return value type is not as it should .. see snippet below...
fixed
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: <arkadiusz(dot)staron(at)dreamlab(dot)pl>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-22 09:03:46
Message-ID: EA6A3F5C1E4BC14D91D93A344436440C010D330F@MXMBON01.grupa.onet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> > I do not know how, but in some strange, random cases function isalpha()
> stops return true value for polish letters, and that is when RS_compile()
> returns error.
> Hmm, very strange. Which OS do you use?
> Pls, show exact
> # show lc_ctype;
> # show lc_collate;
> and tsearch2 configuration

Linux 2.6.14.4-dl380

lc_ctype
----------------
pl_PL.iso88592

lc_collate
----------------
pl_PL.iso88592

The other interesting thing is that, although tolower() and isalpha() functionality is broken, sorting polish letters works fine ...

Tsearch2 is configured as follows:

INSERT INTO pg_ts_cfg (...) VALUES ('default_polish', 'default', 'pl_PL');

INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'url', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'host', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'sfloat', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'uri', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'int', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'float', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'email', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'word', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'hword', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlword', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlpart_hword', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'part_hword', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'nlhword', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'file', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'uint', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'version', '{simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lhword', '{pl_ispell,simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lpart_hword','{pl_ispell,simple}');
INSERT INTO pg_ts_cfgmap (...) VALUES( 'default_polish', 'lword', '{pl_ispell,simple}');

INSERT INTO pg_ts_dict
(SELECT 'pl_ispell',
dict_init,
'DictFile="/home/astaron/lib/ispell/polish.dic",'
'AffFile="/home/astaron/lib/ispell/polish.aff",'
'StopFile="/home/astaron/lib/ispell/polish.stop"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'ispell_template');

If there is anything, I can do to help you to debug
this issue (logs, tests, code changes..), please let me know.

As for now I will run 8.2 and see if the problem persists ...

Best regards,
Arek.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: arkadiusz(dot)staron(at)dreamlab(dot)pl
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-22 14:35:43
Message-ID: 4564603F.5030907@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> INSERT INTO pg_ts_cfg (...) VALUES ('default_polish', 'default', 'pl_PL');

If your mark locale as 'pl_PL.iso88592' instead of 'pl_PL' then tsearch2 will be
able to find configuration itself.

> If there is anything, I can do to help you to debug
> this issue (logs, tests, code changes..), please let me know.
>
> As for now I will run 8.2 and see if the problem persists ...
Does lower()/upper() functions works well in postgres?

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: <arkadiusz(dot)staron(at)dreamlab(dot)pl>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-22 16:00:36
Message-ID: EA6A3F5C1E4BC14D91D93A344436440C010D343F@MXMBON01.grupa.onet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> If your mark locale as 'pl_PL.iso88592' instead of 'pl_PL' then tsearch2
> will be
> able to find configuration itself.

Good point.. I forgot about this ;-)

>
> Does lower()/upper() functions works well in postgres?

Until regis error it works fine... then it gets broken.
As the matter of fact I wasn't able to determine who breaks it, is it postgres or tsearch ...

Any idea how can I check it ?

Regards,
Arek.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: arkadiusz(dot)staron(at)dreamlab(dot)pl
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-11-22 16:11:52
Message-ID: 456476C8.8080206@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Does lower()/upper() functions works well in postgres?
>
> Until regis error it works fine... then it gets broken.
> As the matter of fact I wasn't able to determine who breaks it, is it postgres or tsearch ...
>
> Any idea how can I check it ?

It seems to me, it's a memory corruption somewhere.

try to compile postgres(and tsearch2 too) with
CFLAGS=-O0 ./configure --enable-cassert --enable-debug
and repeats the tests

If you are using recent versions of Linux libc (later than 5.4.23) and GNU
libc (2.x) then it will be useful to set MALLOC_CHECK_ enviroment variable to 2
before starting postgres (man 3 malloc).

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: <arkadiusz(dot)staron(at)dreamlab(dot)pl>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tsearch + polish ispell + polish locale
Date: 2006-12-01 13:07:15
Message-ID: EA6A3F5C1E4BC14D91D93A344436440C015D2A9B@MXMBON01.grupa.onet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

FYI,

The problem does NOT exist in 8.2beta3.
I think it can be assumed that this was some locale related issue ...

Thanks for your help,
Arek.

> -----Original Message-----
> From: Teodor Sigaev [mailto:teodor(at)sigaev(dot)ru]
> Sent: Wednesday, November 22, 2006 5:12 PM
> To: Staroń Arkadiusz
> Cc: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Tsearch + polish ispell + polish locale
>
> >> Does lower()/upper() functions works well in postgres?
> >
> > Until regis error it works fine... then it gets broken.
> > As the matter of fact I wasn't able to determine who breaks it, is it
> postgres or tsearch ...
> >
> > Any idea how can I check it ?
>
> It seems to me, it's a memory corruption somewhere.
>
> try to compile postgres(and tsearch2 too) with
> CFLAGS=-O0 ./configure --enable-cassert --enable-debug
> and repeats the tests
>
> If you are using recent versions of Linux libc (later than 5.4.23) and
> GNU
> libc (2.x) then it will be useful to set MALLOC_CHECK_ enviroment variable
> to 2
> before starting postgres (man 3 malloc).
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW:
> http://www.sigaev.ru/