Re: exposing more parse was: Re: tsearch2: setting weights on tsquery

Lists: pgsql-general
From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: tsearch2: setting weights on tsquery
Date: 2008-10-21 08:29:52
Message-ID: 20081021102952.420deb9a@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

plainto_tsquery is handy to make a string from users turn into a
tsquery.

This strips "control" characters and glue lexemes with &.

Now I've several strings coming from input user and what I'd like to
do is assign a different token to each part.

eg.
input1 = "ratto && matto | gatto & the"
input2 = "sasso|&passo lasso a"
->
tsquery = 'ratto:A & matto:A & gatto:A & sasso:B & passo:B & lasso:B'

I could prepare the input outside postgresql and then
"concatenate" the queries but plainto_tsquery is very comfortable
since it will always actually "clean up" the right way the input and
adds & at the right place.

Otherwise I could use more than one tsvector for searching input1
and input2 but it seems it is slower than
update t1 set FT1IDX=(

setweight(to_tsvector('pg_catalog.english',
coalesce(input1,'')), 'A') || ' ' ||

setweight(to_tsvector('pg_catalog.english',
coalesce(input2,'')), 'B')
)

and I won't be able rank on all fields at a time.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: exposing more parse was: Re: tsearch2: setting weights on tsquery
Date: 2008-10-21 11:20:12
Message-ID: 20081021132012.4fc92f56@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 21 Oct 2008 10:29:52 +0200
Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> wrote:

I came across this:
http://grokbase.com/topic/2007/08/07/general-tsearch2-plainto-tsquery-with-or/r92nI5l_k9S4iKcWdCxKs05yFQk

And I find it is strictly related to my needs.
Working around ts_parse I could get an array of tokens, filter on
the one that are not considered significative and get an array
similar to: strip(to_tsvector(...))
Since now I've an array I could then loop on it and build up a
ts_query according to my needs.

While Mike Rylander's solution may work the regexp is not a good
enough substitute for tsearch2 parser.

I didn't have the time to look at the source code (and actually I'd
feel a bit pity to download some Mb to just look at a function) but
I bet all the code spend a lot of time to turn strings into
tsvectors.
If there were more tools to manipulate and turn tsvectors into
tsquery we could build more interesting search functions in plpsql
et al. and maybe cache some of the conversions (string -> tsvector)
inspite of spending time serializing an de-serializing stuff.

But here is the more pragmatical question:
### LOOK HERE FIRST ;) ####
Or am I missing something of the already available
functions/operations?

> plainto_tsquery is handy to make a string from users turn into a
> tsquery.
>
> This strips "control" characters and glue lexemes with &.

> Now I've several strings coming from input user and what I'd like
> to do is assign a different token to each part.
>
> eg.
> input1 = "ratto && matto | gatto & the"
> input2 = "sasso|&passo lasso a"
> ->
> tsquery = 'ratto:A & matto:A & gatto:A & sasso:B & passo:B &
> lasso:B'
>
> I could prepare the input outside postgresql and then
> "concatenate" the queries but plainto_tsquery is very comfortable
> since it will always actually "clean up" the right way the input
> and adds & at the right place.
>
> Otherwise I could use more than one tsvector for searching input1
> and input2 but it seems it is slower than
> update t1 set FT1IDX=(
>
> setweight(to_tsvector('pg_catalog.english',
> coalesce(input1,'')), 'A') || ' ' ||
>
> setweight(to_tsvector('pg_catalog.english',
> coalesce(input2,'')), 'B')
> )
>
> and I won't be able rank on all fields at a time.
>

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: exposing more parse was: Re: tsearch2: setting weights on tsquery
Date: 2008-10-21 14:03:05
Message-ID: 20081021160305.6e9679ba@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 21 Oct 2008 13:20:12 +0200
Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> wrote:

> On Tue, 21 Oct 2008 10:29:52 +0200
> Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> wrote:
>
> I came across this:
> http://grokbase.com/topic/2007/08/07/general-tsearch2-plainto-tsquery-with-or/r92nI5l_k9S4iKcWdCxKs05yFQk

that's the best thing I came up with:

select * from regexp_replace(
strip(
to_tsvector(
'<a:A href="http://www.google.com">pippo</a> pluto gigio'
))::text,
'\'([^\']*)\'','\\1:A','g');

This actually exploit the parser and you can specify the
configuration.

It would still be nice to be able to directly work with tsvector
and tsquery so people could exploit the parser, lexer etc... and
recycle the config.

I'd thinking something in the line of

for lex in select * from to_tsvector('jsjdjd fdsds')

would be nice.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: exposing more parse was: Re: tsearch2: setting weights on tsquery
Date: 2008-10-21 14:36:20
Message-ID: 20537.1224599780@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> writes:
> It would still be nice to be able to directly work with tsvector
> and tsquery so people could exploit the parser, lexer etc... and
> recycle the config.

> I'd thinking something in the line of
> for lex in select * from to_tsvector('jsjdjd fdsds')

Don't the tsearch debug functions provide what you want?
http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html

regards, tom lane


From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: exposing more parse was: Re: tsearch2: setting weights on tsquery
Date: 2008-10-21 16:06:25
Message-ID: 20081021180625.4dd3c14c@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 21 Oct 2008 10:36:20 -0400
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> writes:
> > It would still be nice to be able to directly work with tsvector
> > and tsquery so people could exploit the parser, lexer etc... and
> > recycle the config.
>
> > I'd thinking something in the line of
> > for lex in select * from to_tsvector('jsjdjd fdsds')
>
> Don't the tsearch debug functions provide what you want?
> http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html

I missed it. Thanks. Nearly perfect. Now I've to understand what a
{} is.
An array with a null element? an empty array? an array containing ''?
Since I can't
where lexemes<>'{}'::text[] and variations.

select token, lexemes from ts_debug('pg_catalog.english',
'a cat eaten on the net set eat seat ate late things'
) where array_upper(lexemes,1)>=1;

The above works as expected so it should be an empty array.

select token, lexemes from ts_debug('pg_catalog.english',
'a cat eaten on the net set eat seat ate late things'
) where lexemes<>'{}'::text[];

This doesn't (?)

I did google and I've found some of your answers about empty arrays
but none worked "embedded" directly in the sql statement.

variations of this still don't work:

create or replace function zsetweight(out _tsquery text) as
$$
declare
_tsdebug record;
_tmp text[]:='{}';
begin
_tsquery:='';
for _tsdebug in select token
from ts_debug('pg_catalog.english',
'a cat eaten on the net set eat seat ate late things'
) where lexemes<>_tmp loop
_tsquery:=coalesce(_tsquery,'') || ' | ' ||
coalesce(_tsdebug.token,'');
end loop;
return;
end;
$$ language plpgsql immutable;

select * from zsetweight();

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: exposing more parse was: Re: tsearch2: setting weights on tsquery
Date: 2008-10-21 17:40:33
Message-ID: 27925.1224610833@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> writes:
> I missed it. Thanks. Nearly perfect. Now I've to understand what a
> {} is.
> An array with a null element? an empty array? an array containing ''?

Hmm ... it appears that ts_lexize is returning a one-dimensional array of
no elements, whereas '{}'::text[] returns a zero-dimensional array;
as does array[]::text[] so you can't easily get around it :-(

We probably ought to clean this up...

In the meantime it looks like you could check for
"array_upper(lexemes,1) > 0".

regards, tom lane


From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: tsearch2: setting weights on tsquery plainto_tsquery with | or and weight
Date: 2008-10-21 18:37:26
Message-ID: 20081021203726.15c87109@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 21 Oct 2008 13:40:33 -0400
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> writes:
> > I missed it. Thanks. Nearly perfect. Now I've to understand what
> > a {} is.
> > An array with a null element? an empty array? an array
> > containing ''?
>
> Hmm ... it appears that ts_lexize is returning a one-dimensional
> array of no elements, whereas '{}'::text[] returns a
> zero-dimensional array; as does array[]::text[] so you can't
> easily get around it :-(

Maybe in 8.4. I'm already thankful this toy is making such wonders.

> We probably ought to clean this up...
>
> In the meantime it looks like you could check for
> "array_upper(lexemes,1) > 0".

My version as posted before was the >=1 variant.

This is what I was trying to obtain a weighted version of
plainto_tsquery with parametrised operator:

I think I'm wasting some cpu cycles when ts_debug try to fill the
lexemes array that I'm not using.
So maybe there should be more finer access to the functions to help
build up tsquery.
The ts_parse is missing the information about what tokens should be
discarded.

create or replace function setweight(configuration regconfig,
queryin text, weight char, op char ,out query tsquery) as
$$
declare
tsdebug record;
querystring text;
begin
querystring:='';
for tsdebug in select token from
ts_debug(configuration, queryin)
where array_upper(lexemes,1)>0 loop
querystring:=querystring ||
coalesce(tsdebug.token ||
coalesce(':' || weight, '') ||
' ', '');
end loop;
querystring:=regexp_replace(querystring, ' (.)',
op || '\\1', 'g');
query:=to_tsquery(configuration, querystring);
return;
end;
$$ language plpgsql immutable;

--
Ivan Sergio Borgonovo
http://www.webthatworks.it