Re: ispell file format

Lists: pgsql-hackers
From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: ispell file format
Date: 2007-08-22 18:09:07
Message-ID: 46CC7BC3.9070707@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Is the file format for the ispell dictionary documented somewhere?
There's apparently support for an old and a new format, but I can't
figure out what the formats are.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ispell file format
Date: 2007-08-22 18:41:51
Message-ID: Pine.LNX.4.64.0708222240530.2727@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 22 Aug 2007, Heikki Linnakangas wrote:

> Is the file format for the ispell dictionary documented somewhere?
> There's apparently support for an old and a new format, but I can't
> figure out what the formats are.

ispell, myspell and hunspell formats are supported automagically.
They are available from openoffice.org

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ispell file format
Date: 2007-08-23 09:38:16
Message-ID: 46CD5588.5080404@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Oleg Bartunov wrote:
> On Wed, 22 Aug 2007, Heikki Linnakangas wrote:
>
>> Is the file format for the ispell dictionary documented somewhere?
>> There's apparently support for an old and a new format, but I can't
>> figure out what the formats are.
>
> ispell, myspell and hunspell formats are supported automagically.
> They are available from openoffice.org

I downloaded a finnish ispell dictionary and affix (the small version)
from ispell-fi.sourceforge.net, converted if to UTF-8 with iconv, but
it's not accepting the affix file:

ERROR: syntax error at line 83 of affix file
"/home/hlinnaka/pgsql.cvshead/share/tsearch_data/finnish.affix"

Here's a snippet of the affix file around that line:

> prefixes
>
> flag *A:
> . > ALI # alivaltiosihteeri, alihankkija # line 83
> I > ALI\-

ispell works just fine with it.

I found a man page describing the ispell file format with Google:
http://www.delorie.com/gnu/docs/ispell/ispell.4.html. Is this the same
file format tsearch accepts? It looks like the grammar we accept is only
a small subset of the ispell grammar, there's things statements like
"boundarychars", "stringchar" that we apparently don't support.

Now is not a good time to start rewriting that, but at least we need to
know what exactly we support and what not. In the long run, it might be
cleaner to use yacc for the parser.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ispell file format
Date: 2007-08-23 13:08:54
Message-ID: 46CD86E6.9070809@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Here's a snippet of the affix file around that line:
>
>> prefixes
>>
>> flag *A:
>> . > ALI # alivaltiosihteeri, alihankkija # line 83
>> I > ALI\-
Just remove the rules with \-, tsearch allows only alpha character here.

>
> ispell works just fine with it.
>
> I found a man page describing the ispell file format with Google:
> http://www.delorie.com/gnu/docs/ispell/ispell.4.html. Is this the same
> file format tsearch accepts? It looks like the grammar we accept is only
> a small subset of the ispell grammar, there's things statements like
> "boundarychars", "stringchar" that we apparently don't support.

Yes, that options are useless for dictionary:
- string char is already checked by postgres itself (by recode or verify functions)
- parser already splits words and default parser treat '-' as word separator

Hmm, I found another problem here. After removing that rules every works fine
with fi_FI.ISO8859-1 locale but with fi_FI.UTF-8, I'll dig tomorrow into it.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/