does ispell have allaffixes set to on?

From: Brian <brian(at)photoresearchers(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: does ispell have allaffixes set to on?
Date: 2010-01-18 07:45:21
Message-ID: 4B541191.4090308@photoresearchers.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I was testing the ispell text search dictionary and it appears to be behaving as
if the ispell option "allaffixes" was set to "on". This wasn't the case for the
original tsearch2 contrib module, and for the ispell program itself which
defaults to "off".

So for example, if I create a simple DictFile with an entry for the word "brand"
(brand/DGRS) and a simple english affix AffFile that does those standard
ispell suffixes (*D > ED, *G > ING, *R > ER and *S > S) along with the standard
ispell prefixes (*A: . > RE, *I: . > IN, *U . > UN) then the ispell dictionary
will return a lexeme for any input token containing a suffix and one of those
prefixes even though NONE of the prefixes have been listed in the dictionary
file as active for that word.

The following is observed and expected:

mydb=> CREATE TEXT SEARCH DICTIONARY test_ispell (
TEMPLATE = ispell,
DictFile = test,
AffFile = test,
StopWords = english );

mydb=> SELECT
ts_lexize('test_ispell', 'branding') AS sfx_yes,
ts_lexize('test_ispell', 'brandest') AS sfx_no,
ts_lexize('test_ispell', 'notindict') AS dict_no,
ts_lexize('test_ispell', 'rebrand') AS pfx_no;
sfx_yes | sfx_no | dict_no | pfx_no
---------+--------+---------+--------
{brand} | | |
(1 row)

However, the following results are NOT expected:

mydb=> SELECT
ts_lexize('test_ispell', 'unbranded') AS sfx_wpfx1,
ts_lexize('test_ispell', 'rebranding') AS sfx_wpfx2;
sfx_wpfx1 | sfx_wpfx2
-----------+-----------
{brand} | {brand}
(1 row)

In that second statement I expect NULL values indicating that the tokens are
unknown, rather than lexemes indicating a match. Is this expected behavior or a
bug, and is there any way to control this? Before I try to patch this in the
code I'd like to know if it's intentional behavior or not.

It gets even screwier if you add "rebrand" to the dictionary (e.g. rebrand/DGS).
Then ts_lexize('test_ispell', 'rebranding') returns an array of both lexemes
"{rebrand,brand}", when only the first is anticipated and wanted.

Thanks,

Brian Carp

Browse pgsql-general by date

  From Date Subject
Next Message Jakub Bednář 2010-01-18 08:19:40 Mapping Java BigDecimal
Previous Message Martijn van Oosterhout 2010-01-18 07:28:18 Re: vacuum issues under load?