Re: Tsearch vs Snowball, or what's a source file?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tsearch vs Snowball, or what's a source file?
Date: 2007-06-15 02:47:55
Message-ID: 23992.1181875675@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> 2 Snowball's compiling infrastructure doesn't support Windows target.

> Yeah. Another problem with using their original source code is that
> running the Snowball compiler during build would not work for
> cross-compiled builds of Postgres, at least not without solving the
> problem of building some code for the host platform instead of the
> target.

> So what I'm thinking now is we should import libstemmer instead of the
> snowball_code representation. I haven't gotten as far as thinking about
> exactly how to lay out the files though.

I've done some more work on this point. After looking at the Snowball
code in more detail, I'm thinking it'd be a good idea to keep it at
arm's length in a loadable shared library, instead of incorporating it
directly into the backend. This is because they don't see anything
wrong with exporting random global function names like "eq_v" and
"skip_utf8"; so the probability of name collisions is a bit too high for
my taste. The current tsearch_core patch envisions having a couple of
the snowball stemmers in the core backend and the rest in a loadable
library, but I suggest we just put them all in a loadable library, with
the only entry points being snowball_init() and snowball_lexize()
tsearch dictionary support functions. (I am thinking of having just one
such function pair, with the init function taking an init option to
select which stemmer to use, instead of a separate Postgres function
pair per stemmer.)

Attached is a rough proof-of-concept patch for this. It doesn't do
anything useful, but it does prove that we can compile and link the
Snowball stemmers into a Postgres loadable module with only trivial
changes to their source code. The code compiles cleanly (zero warnings
in gcc). The file layout is

src/backend/snowball/Makefile our files
src/backend/snowball/README
src/backend/snowball/dict_snowball.c
src/backend/snowball/libstemmer/*.c their .c files

src/include/snowball/header.h intercepting .h file
src/include/snowball/libstemmer/*.h their .h files

If there're no objections, I'll push forward with completing the
dictionary support functions to go with this infrastructure.

regards, tom lane

Attachment Content-Type Size
snowball-add.tar.gz application/octet-stream 2.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-06-15 02:52:24 How does the tsearch configuration get selected?
Previous Message Gregory Maxwell 2007-06-15 02:37:14 Re: Sorted writes in checkpoint