Re: tsearch Parser Hacking

Lists: pgsql-hackers
From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: tsearch Parser Hacking
Date: 2011-02-14 22:45:07
Message-ID: C62AC9C5-9968-46ED-B952-B026F3865A79@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? That is, instead of this:

try=# select * from ts_debug('simple'::regconfig, 'w/d');
alias │ description │ token │ dictionaries │ dictionary │ lexemes
───────┼───────────────────┼───────┼──────────────┼────────────┼─────────
file │ File or path name │ w/d │ {simple} │ simple │ {w/d}

Ideally it'd think that / was the same as -:

try=# select * from ts_debug('simple'::regconfig, 'w-d');
alias │ description │ token │ dictionaries │ dictionary │ lexemes
─────────────────┼─────────────────────────────────┼───────┼──────────────┼────────────┼─────────
asciihword │ Hyphenated word, all ASCII │ w-d │ {simple} │ simple │ {w-d}
hword_asciipart │ Hyphenated word part, all ASCII │ w │ {simple} │ simple │ {w}
blank │ Space symbols │ - │ {} │ [null] │ [null]
hword_asciipart │ Hyphenated word part, all ASCII │ d │ {simple} │ simple │ {d}
(4 rows)

Possible? Or would I have to write a completely new parser just to change this bit?

Thanks,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-14 23:57:06
Message-ID: 23485.1297727826@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?

There is zero, none, nada, provision for modifying the behavior of the
default parser, other than by changing its compiled-in state transition
tables.

It doesn't help any that said tables are baroquely designed and utterly
undocumented.

IMO, sooner or later we need to trash that code and replace it with
something a bit more modification-friendly.

regards, tom lane


From: Thom Brown <thom(at)linux(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 00:02:02
Message-ID: AANLkTi=YsMnssRQUNodKQ59xudzV+S0Dr3PkgoSwOaLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14 February 2011 23:57, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?
>
> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.

This is very true. I intended to look into adding new tokens, but gave
up when I couldn't see how those transition tables worked.

> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

+1 for annihilating the existing code at some point.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935


From: David Blewett <david(at)dawninglight(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 02:40:41
Message-ID: AANLkTin6cLgPHT4VxJiyoPq0q86ygbm6_1tkTVAqjB16@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 14, 2011 at 6:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?
>
> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.
>
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

I added this to the TODO as something that can be tackled in the
future. I've been wishing it would be possible to add other tokens as
well (Python dotted path 'foo.bar.baz', Perl namespace path
'Foo::Bar', more flexible version number parsing, etc).

David Blewett


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 04:03:07
Message-ID: 56FF6D1F-9954-4D32-B403-DC98769831FF@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 14, 2011, at 3:57 PM, Tom Lane wrote:

> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.
>
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

I was afraid you'd say that. Thanks.

David


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 05:58:52
Message-ID: AANLkTikaWz7u74MVGLvsHtz4JDGOG91heFN7NEzTtTuO@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I agree that it will be a good idea to rewrite the entire thing. However, in
the mean time, I sent a proposal earlier

http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php

And a patch later:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php

Tom asked me to look into Compound Word support but I found it not usable.
Here was my response:
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00419.php

I have not got any response since then,

-Sushant.

On Tue, Feb 15, 2011 at 9:33 AM, David E. Wheeler <david(at)kineticode(dot)com>wrote:

> On Feb 14, 2011, at 3:57 PM, Tom Lane wrote:
>
> > There is zero, none, nada, provision for modifying the behavior of the
> > default parser, other than by changing its compiled-in state transition
> > tables.
> >
> > It doesn't help any that said tables are baroquely designed and utterly
> > undocumented.
> >
> > IMO, sooner or later we need to trash that code and replace it with
> > something a bit more modification-friendly.
>
> I was afraid you'd say that. Thanks.
>
> David
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 07:37:53
Message-ID: Pine.LNX.4.64.1102151034050.278@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David,

it's not easy to hack tsearch parser, sorry. You can preparse your input
before to_tsquery,to_tsvector.

Oleg
On Mon, 14 Feb 2011, David E. Wheeler wrote:

> Hackers,
>
> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token? That is, instead of this:
>
> try=# select * from ts_debug('simple'::regconfig, 'w/d');
> alias │ description │ token │ dictionaries │ dictionary │ lexemes
> ───────┼───────────────────┼───────┼──────────────┼────────────┼─────────
> file │ File or path name │ w/d │ {simple} │ simple │ {w/d}
>
> Ideally it'd think that / was the same as -:
>
> try=# select * from ts_debug('simple'::regconfig, 'w-d');
> alias │ description │ token │ dictionaries │ dictionary │ lexemes
> ─────────────────┼─────────────────────────────────┼───────┼──────────────┼────────────┼─────────
> asciihword │ Hyphenated word, all ASCII │ w-d │ {simple} │ simple │ {w-d}
> hword_asciipart │ Hyphenated word part, all ASCII │ w │ {simple} │ simple │ {w}
> blank │ Space symbols │ - │ {} │ [null] │ [null]
> hword_asciipart │ Hyphenated word part, all ASCII │ d │ {simple} │ simple │ {d}
> (4 rows)
>
> Possible? Or would I have to write a completely new parser just to change this bit?
>
> Thanks,
>
> David
>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 07:42:20
Message-ID: 9DBC274C-5B13-4F8E-A3DD-DF94D0A87E05@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 14, 2011, at 11:37 PM, Oleg Bartunov wrote:

> it's not easy to hack tsearch parser, sorry. You can preparse your input
> before to_tsquery,to_tsvector.

Yeah, I was thinking about s{/}{-}g before passing the values in. Might be the only way to do it for now…

Thanks,

David


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 07:44:35
Message-ID: Pine.LNX.4.64.1102151038010.278@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 14 Feb 2011, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> Is it possible to modify the default tsearch parser so that / doesn't get lexed as a "file" token?
>
> There is zero, none, nada, provision for modifying the behavior of the
> default parser, other than by changing its compiled-in state transition
> tables.
>
> It doesn't help any that said tables are baroquely designed and utterly
> undocumented.

what do you mean 'baroquely' ? Do you know 'gothic' design :?

>
> IMO, sooner or later we need to trash that code and replace it with
> something a bit more modification-friendly.

We thought about configurable parser, but AFAIR, we didn't get any support
for this at that time.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-15 07:56:57
Message-ID: Pine.LNX.4.64.1102151046410.278@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 14 Feb 2011, David E. Wheeler wrote:

> On Feb 14, 2011, at 11:37 PM, Oleg Bartunov wrote:
>
>> it's not easy to hack tsearch parser, sorry. You can preparse your input
>> before to_tsquery,to_tsvector.
>
> Yeah, I was thinking about s{/}{-}g before passing the values in. Might be the only way to do it for now?

actually, it's not so difficult to *hack* parser to treat '/' as '-'.
I thought about overriding some default parser behaviour, but didn't come
to any useful solution.
btw, some users already wrote their own parsers and even I have little
tutorial:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html
I wonder if it's worth to add it to
http://www.postgresql.org/docs/8.4/static/test-parser.html

Probably, good paper/presentation along with improving code docs would be
enough for now, until someone got very bright idea about parser and
time to implement it.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-16 22:22:51
Message-ID: 3C434EAE-819A-4FD0-ADEC-07A0EE955A14@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote:

>> IMO, sooner or later we need to trash that code and replace it with
>> something a bit more modification-friendly.
>
> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time.

What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions? Maybe require that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily.

Best,

David


From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-17 10:30:05
Message-ID: E47298A1-A85C-4256-86A5-C06E3DEEB6F0@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16 Feb 2011, at 23:22, "David E. Wheeler" <david(at)kineticode(dot)com> wrote:

> On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote:
>
>>> IMO, sooner or later we need to trash that code and replace it with
>>> something a bit more modification-friendly.
>>
>> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time.
>
> What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions? Maybe require that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily.

I had just the same thought in mind. But so far I systematically substitute _ and a few other characters to ł which doesn't get interpreted as blanks. But more direct control would be appreciated

Jesper


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tsearch Parser Hacking
Date: 2011-02-17 19:57:35
Message-ID: Pine.LNX.4.64.1102172253190.278@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David,

as a cool perl guy you can easily take OpenFTS (openfts.sourceforge.net),
which provides perl interface to tsearch datatypes, and develop a
plperl version. That would be interesting for many people, who like flexibility
of perl. We personally use openfts in our web projects,i.e., we use tsearch as
a storage and we prepare tsvector externally. Openfts distribution contains
tests, examples of dictionaries, parser. Current interface of configuration
is ugly, but it should be not difficult to write table driven configuration.

What do you think ?

Oleg

On Wed, 16 Feb 2011, David E. Wheeler wrote:

> On Feb 14, 2011, at 11:44 PM, Oleg Bartunov wrote:
>
>>> IMO, sooner or later we need to trash that code and replace it with
>>> something a bit more modification-friendly.
>>
>> We thought about configurable parser, but AFAIR, we didn't get any support for this at that time.
>
> What would it take to change the requirement such that *any* SQL function could be a parser, not only C functions? Maybe require that they turn a nested array of tokens? That way I could just write a function in PL/Perl quite easily.
>
> Best,
>
> David
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83