Snowball and ispell in tsearch2

Lists: pgsql-hackers
From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Snowball and ispell in tsearch2
Date: 2006-06-07 17:06:43
Message-ID: 448707A3.1020209@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We got a lot requests about including stemmers and ispell dictionaries for all
accessible languages into tsearch2. I understand that tsearch2 will be closer to
end user. But sources of snowball stemmers is about 800kb, each ispell
dictionaries will takes about 0.5-2M. All sizes are sized with compression. I am
afraid that is too big size...

What are opinions?

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Markus Schiltknecht <markus(at)bluegap(dot)ch>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-07 17:29:56
Message-ID: 44870D14.5030401@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello Teodor,

I've just recently implemented an advanced full-text search function on
top of tsearch2. Searching through the manuals and websites to get the
snowball stemmer and compile my own module took me way to long. I'd
rather go fetch a cup of coffee during a 30 minute download...

That said, I don't necessarily mean that all stemmers must be included
in CVS or such. It should just be simpler for the database administrator
to install ispell or stemmer 'modules'. A non-plus-ultra solution would
be to provide packages for each language (in debian or fedora, etc..).

Perhaps we can put together the source code for all languages modules
available and provide scripts to fetch ispell data or to generate the
snowball stemmers. A debian package maintainer would have to fetch all
the data to generate all language packages. Someone else might just want
to download and compile a norwegian snowball stemmer.

I'd be willing to help with such a project. I have experience with
tsearch2 as well as with gentoo and debian packaging. I can't help with
rpm, though.

Regards

Markus

Teodor Sigaev wrote:
> We got a lot requests about including stemmers and ispell dictionaries
> for all accessible languages into tsearch2. I understand that tsearch2
> will be closer to end user. But sources of snowball stemmers is about
> 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
> sized with compression. I am afraid that is too big size...
>
> What are opinions?
>


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-07 18:39:15
Message-ID: 44871D53.1060508@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
Sorry, withOUT compression...

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: "John Jawed" <johnjawed(at)gmail(dot)com>
To: "Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-07 18:51:13
Message-ID: a9eb35850606071151l3913dfc7wc28ac81d1c227dfb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

OpenFTS ebuild: http://bugs.gentoo.org/show_bug.cgi?id=135859

It has a USE flag for the snowball stemmer. I can take care of
packaging for Gentoo if it will free up time for you to work on other
distros.

John

PS, upstream package size isn't, and shouldn't be an issue, it should
be left to the packaging systems to discretely fetch what is needed.

On 6/7/06, Markus Schiltknecht <markus(at)bluegap(dot)ch> wrote:

> That said, I don't necessarily mean that all stemmers must be included
> in CVS or such. It should just be simpler for the database administrator
> to install ispell or stemmer 'modules'. A non-plus-ultra solution would
> be to provide packages for each language (in debian or fedora, etc..).
>
> I'd be willing to help with such a project. I have experience with
> tsearch2 as well as with gentoo and debian packaging. I can't help with
> rpm, though.
>
> Regards
>
> Markus
>
> Teodor Sigaev wrote:
> > We got a lot requests about including stemmers and ispell dictionaries
> > for all accessible languages into tsearch2. I understand that tsearch2
> > will be closer to end user. But sources of snowball stemmers is about
> > 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
> > sized with compression. I am afraid that is too big size...
> >
> > What are opinions?
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


From: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-08 01:38:19
Message-ID: 44877F8B.30808@calorieking.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> We got a lot requests about including stemmers and ispell dictionaries
> for all accessible languages into tsearch2. I understand that tsearch2
> will be closer to end user. But sources of snowball stemmers is about
> 800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
> sized with compression. I am afraid that is too big size...
>
> What are opinions?

Maybe putting it on pgFoundry?


From: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
To: Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-08 01:39:10
Message-ID: 44877FBE.2000503@calorieking.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Perhaps we can put together the source code for all languages modules
> available and provide scripts to fetch ispell data or to generate the
> snowball stemmers. A debian package maintainer would have to fetch all
> the data to generate all language packages. Someone else might just want
> to download and compile a norwegian snowball stemmer.
>
> I'd be willing to help with such a project. I have experience with
> tsearch2 as well as with gentoo and debian packaging. I can't help with
> rpm, though.

I could help with a FreeBSD package I suppose.


From: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
To: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
Cc: Markus Schiltknecht <markus(at)bluegap(dot)ch>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-08 01:47:07
Message-ID: 4487819B.7010705@calorieking.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> I'd be willing to help with such a project. I have experience with
>> tsearch2 as well as with gentoo and debian packaging. I can't help
>> with rpm, though.
>
> I could help with a FreeBSD package I suppose.

Although I should probably finish up those damn GIN docs first :)


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-08 16:25:27
Message-ID: 44884F77.3090508@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Maybe putting it on pgFoundry?

Hmm, it's a variant. We can create project 'tsearch2_dict' and there I'll place
contrib module which will make all Snowball stemmers. Right now I'm working on
supporting OpenOffice's dictionaries in tsearch2, so it will be simple to add it
to packaging system.

I suggest that in the same cvs somebody will manage packages/package's builder
for different packaging system (sorry, I havn't any experience with that systems)

BTW, it will be good, if packaging will work with "maked" postgres, something like:
% cd PGSQL/contrib/tsearch2
% make LANG=norwegian

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Snowball and ispell in tsearch2
Date: 2006-06-09 14:27:41
Message-ID: 4489855D.4030600@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I'll place contrib module which will make all Snowball stemmers. Right
> now I'm working on supporting OpenOffice's dictionaries in tsearch2, so
> it will be simple to add it to packaging system.

done, http://archives.postgresql.org/pgsql-committers/2006-06/msg00112.php

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/