a tsearch2 (8.2.4) dictionary that only filters out stopwords

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: pgsql-patches(at)postgresql(dot)org
Subject: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date: 2007-11-09 01:22:34
Message-ID: 4733B65A.9030707@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Hi,

the rationale for this patch is rather complicated, as it's related to
the peculiarities of Polish grammar. Please read on.

I'm using PostgreSQL 8.2.4 and the ispell tsearch2 dictionary. The
problem is as follows. In Polish (and possibly other languages that
don't come to my mind at the moment) a noun can take different forms
depending on the grammatical context. This is called declension. For
exmple the noun 'oda' (which means 'ode' in English) can take the form
'od' in certain cases. However, the word in Polish 'od' is also a
preposition. The problem with the ispell dictionary is that it first
reduces a lexeme to it's stem and then checks whether it is or is not a
stopword.

This means that I either have to agree with the fact that the tsvectors
for my documents will contain large numbers of the noun 'oda' (because
each time a preposition 'od' is used in the text it will be stemmed to
produce 'oda' and then indexed) or I have to include the word 'oda' in
the stopwords file and thus eliminate a perfectly good noun from my
tsvectors.

The solution I came up with was simple: write a dictionary, that does
only one thing: looks up the lexeme in a stopwords file and either
discards it or returns NULL. That way I could use it as the first
dictionary is the dictionary stach for lexeme types I'm interested in
and it would discard every instance of 'od', while passing every
non-stopword (in particular 'oda') to the ispell dictionary.

Tha attached patch adds a dictionary called stop to the set of standard
dictionaries that one gets after installing tsearch2. The C code may not
be first-class (however it works for me in a real business solution) -
it's quite trivial and I'd be happy if some more experienced Postgres
hackers would implement the idea in a cleaner/safer way. It's been
tested on 8.2.4 and compiles on 8.2.5. I haven't even looked at the code
for 8.3 yet, but maybe the change could somehow make it's way into the
integrated full text search?

Regards,
Jan Urbanski
Warsaw University
http://fiok.pl/

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment Content-Type Size
tsearch-stopsieve.patch text/plain 3.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Trevor Talbot 2007-11-09 01:46:08 Re: New tzdata available
Previous Message Alvaro Herrera 2007-11-09 00:50:13 Re: Free Space Map thoughts

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2007-11-09 02:32:09 Fix for stop words in thesaurus file
Previous Message Bruce Momjian 2007-11-09 00:51:36 Re: Contrib docs v1