Re: lexemes in prefix search going through dictionary modifications

Lists: pgsql-hackers
From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: lexemes in prefix search going through dictionary modifications
Date: 2011-10-25 15:26:18
Message-ID: 1319556378.2023.6.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword dictionaries. This seems like a bug
to me.

db=# select to_tsquery('english', 's:*');
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
to_tsquery
------------

(1 row)

db=# select to_tsquery('simple', 's:*');
to_tsquery
------------
's':*
(1 row)

I also think that this is a mistake. It should only be highlighting "s".
db=# select ts_headline('sushant', to_tsquery('simple', 's:*'));
ts_headline
----------------
<b>sushant</b>

Thanks,
Sushant.


From: Florian Pflug <fgp(at)phlo(dot)org>
To: sushant354(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lexemes in prefix search going through dictionary modifications
Date: 2011-10-25 16:05:46
Message-ID: 7407A709-87E3-484D-9E7B-CCBAE7187BF9@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
> I am currently using the prefix search feature in text search. I find
> that the prefix characters are treated the same as a normal lexeme and
> passed through stemming and stopword dictionaries. This seems like a bug
> to me.

Hm, I don't think so. If they don't pass through stopword dictionaries,
then queries containing stopwords will fail to find any rows - which is
probably not what one would expect.

Here's an example:

Query for records containing the* and car*. The @@-operator returns true,
because the stopword is removed from both the tsvector and the tsquery
(the 'english' dictionary drops 'these' as a stopward and stems 'cars' to
'car. Both the tsvector and the query end up being just 'car')

postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('english', 'the:* & car:*');
?column?
----------
t
(1 row)

Here what happens stopwords aren't removed from the query
(Now, the tsvector ends up being 'car', but the query is 'the:* & car:*')

postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('simple', 'the:* & car:*');
?column?
----------
f
(1 row)

best regards,
Florian Pflug


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lexemes in prefix search going through dictionary modifications
Date: 2011-10-25 16:47:25
Message-ID: 1319561245.2023.15.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
> On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
> > I am currently using the prefix search feature in text search. I find
> > that the prefix characters are treated the same as a normal lexeme and
> > passed through stemming and stopword dictionaries. This seems like a bug
> > to me.
>
> Hm, I don't think so. If they don't pass through stopword dictionaries,
> then queries containing stopwords will fail to find any rows - which is
> probably not what one would expect.

I think what you are saying a feature is really a bug. I am fairly sure
that when someone says to_tsquery('english', 's:*') one is looking for
an entry that has a *non-stopword* word that starts with 's'. And
specially so in a text search configuration that eliminates stop words.

Does it even make sense to stem, abbreviate, synonym for a few letters?
It will be so unpredictable.

-Sushant.


From: Florian Pflug <fgp(at)phlo(dot)org>
To: sushant354(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lexemes in prefix search going through dictionary modifications
Date: 2011-10-25 17:27:07
Message-ID: 5A1A958A-6F52-4112-A28C-540B6AFBA34A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct25, 2011, at 18:47 , Sushant Sinha wrote:
> On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
>> On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
>>> I am currently using the prefix search feature in text search. I find
>>> that the prefix characters are treated the same as a normal lexeme and
>>> passed through stemming and stopword dictionaries. This seems like a bug
>>> to me.
>>
>> Hm, I don't think so. If they don't pass through stopword dictionaries,
>> then queries containing stopwords will fail to find any rows - which is
>> probably not what one would expect.
>
> I think what you are saying a feature is really a bug. I am fairly sure
> that when someone says to_tsquery('english', 's:*') one is looking for
> an entry that has a *non-stopword* word that starts with 's'. And
> specially so in a text search configuration that eliminates stop words.

But the whole idea of removing stopwords from the query is that users
*don't* need to be aware of the precise list of stopwords. The way I see
it, stopwords are simply an optimization that helps reduce the size of
your fulltext index.

Assume, for example, that the postgres mailing list archive search used
tsearch (which I think it does, but I'm not sure). It'd then probably make
sense to add "postgres" to the list of stopwords, because it's bound to
appear in nearly every mail. But wouldn't you want searched which include
'postgres*' to turn up empty? Quite certainly not.

> Does it even make sense to stem, abbreviate, synonym for a few letters?
> It will be so unpredictable.

That depends on the language. In german (my native tongue), one can
concatenate nouns to form new nouns. It's this not entirely unreasonable
that one would want the prefix to be stemmed to it's singular form before
being matched.

Also, suppose you're using a dictionary which corrects common typos. Who
says you wouldn't want that to be applied to prefix queries?

best regards,
Florian Pflug


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lexemes in prefix search going through dictionary modifications
Date: 2011-10-25 18:15:17
Message-ID: 1319566517.2023.24.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:

> Assume, for example, that the postgres mailing list archive search used
> tsearch (which I think it does, but I'm not sure). It'd then probably make
> sense to add "postgres" to the list of stopwords, because it's bound to
> appear in nearly every mail. But wouldn't you want searched which include
> 'postgres*' to turn up empty? Quite certainly not.

That improves recall for "postgres:*" query and certainly doesn't help
other queries like "post:*". But more importantly it affects precision
for all queries like "a:*", "an:*", "and:*", "s:*", 't:*', "the:*", etc
(When that is the only search it also affects recall as no row matches
an empty tsquery). Since stopwords are smaller, it means prefix search
for a few characters is meaningless. And I would argue that is when the
prefix search is more important -- only when you know a few characters.

-Sushant.


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lexemes in prefix search going through dictionary modifications
Date: 2011-11-08 16:25:28
Message-ID: 1320769528.2062.16.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think there is a need to provide prefix search to bypass
dictionaries.If you folks think that there is some credibility to such a
need then I can think about implementing it. How about an operator like
":#" that does this? The ":*" will continue to mean the same as
currently.

-Sushant.

On Tue, 2011-10-25 at 23:45 +0530, Sushant Sinha wrote:
> On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:
>
> > Assume, for example, that the postgres mailing list archive search used
> > tsearch (which I think it does, but I'm not sure). It'd then probably make
> > sense to add "postgres" to the list of stopwords, because it's bound to
> > appear in nearly every mail. But wouldn't you want searched which include
> > 'postgres*' to turn up empty? Quite certainly not.
>
> That improves recall for "postgres:*" query and certainly doesn't help
> other queries like "post:*". But more importantly it affects precision
> for all queries like "a:*", "an:*", "and:*", "s:*", 't:*', "the:*", etc
> (When that is the only search it also affects recall as no row matches
> an empty tsquery). Since stopwords are smaller, it means prefix search
> for a few characters is meaningless. And I would argue that is when the
> prefix search is more important -- only when you know a few characters.
>
>
> -Sushant


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: sushant354(at)gmail(dot)com
Cc: Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: lexemes in prefix search going through dictionary modifications
Date: 2011-11-08 23:39:19
Message-ID: 3361.1320795559@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> I think there is a need to provide prefix search to bypass
> dictionaries.If you folks think that there is some credibility to such a
> need then I can think about implementing it. How about an operator like
> ":#" that does this? The ":*" will continue to mean the same as
> currently.

I don't think that just turning off dictionaries for prefix searches is
going to do much of anything useful, because the lexemes in the index
are still going to have gone through normalization. Somehow we need to
identify which lexemes could match the prefix after accounting for the
fact that they've been through normalization.

An example: if the original word is "transferring", the lexeme (in the
english config) is just "transfer". If you search for "transferring:*"
and suppress dictionaries, you'll fail to get a match, which is simply
wrong. It's not a step forward to suppress some failure cases while
adding new ones.

Another point is that whatever we do about this really ought to be
inside the engine, not exposed in a form that makes users do their
queries differently.

regards, tom lane