Re: [HACKERS] Include Lists for Text Search

Lists: pgsql-hackerspgsql-patches
From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Include Lists for Text Search
Date: 2007-09-10 10:50:42
Message-ID: 1189421442.4281.195.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

It seems possible to write your own functions to support various
possibilities with text search.

One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.

There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
for and what they will never be looking for, it seems easier to supply
that list up front, rather than hide it behind lots of hand-crafted
code.

Can we include that functionality now?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 11:58:07
Message-ID: 46E5314F.60308@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> It seems possible to write your own functions to support various
> possibilities with text search.
>
> One of the more common thoughts is to have a list of words that you
> would like to include, i.e. the opposite of a stop word list.
>
> There are clear indications that indexing too many words is a problem
> for both GIN and GIST. If people already know what they'll be looking
> for and what they will never be looking for, it seems easier to supply
> that list up front, rather than hide it behind lots of hand-crafted
> code.

I don't understand what you're proposing. We already have dict_synonym
that you can use to accept a simple list of words. But that doesn't
change the way GIN and GiST works.

?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 12:10:26
Message-ID: Pine.LNX.4.64.0709101557220.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 10 Sep 2007, Simon Riggs wrote:

> It seems possible to write your own functions to support various
> possibilities with text search.
>
> One of the more common thoughts is to have a list of words that you
> would like to include, i.e. the opposite of a stop word list.
>
> There are clear indications that indexing too many words is a problem
> for both GIN and GIST. If people already know what they'll be looking
> for and what they will never be looking for, it seems easier to supply
> that list up front, rather than hide it behind lots of hand-crafted
> code.
>
> Can we include that functionality now?

This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this dictionary. So,
feel free to write it and submit.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 12:28:27
Message-ID: 1189427307.4281.226.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > It seems possible to write your own functions to support various
> > possibilities with text search.
> >
> > One of the more common thoughts is to have a list of words that you
> > would like to include, i.e. the opposite of a stop word list.
> >
> > There are clear indications that indexing too many words is a problem
> > for both GIN and GIST. If people already know what they'll be looking
> > for and what they will never be looking for, it seems easier to supply
> > that list up front, rather than hide it behind lots of hand-crafted
> > code.
>
> I don't understand what you're proposing. We already have dict_synonym
> that you can use to accept a simple list of words.

How does that allow me to limit the number of words to a known list?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 12:28:59
Message-ID: 1189427339.4281.228.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
> On Mon, 10 Sep 2007, Simon Riggs wrote:
>
> > It seems possible to write your own functions to support various
> > possibilities with text search.
> >
> > One of the more common thoughts is to have a list of words that you
> > would like to include, i.e. the opposite of a stop word list.
> >
> > There are clear indications that indexing too many words is a problem
> > for both GIN and GIST. If people already know what they'll be looking
> > for and what they will never be looking for, it seems easier to supply
> > that list up front, rather than hide it behind lots of hand-crafted
> > code.
> >
> > Can we include that functionality now?
>
> This could be realized very easyly using dict_strict, which returns
> only known words, and mapping contains only this dictionary. So,
> feel free to write it and submit.

So there isn't one yet, but you think it will be easy to write and that
we should call it dict_strict?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 12:33:03
Message-ID: Pine.LNX.4.64.0709101631050.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 10 Sep 2007, Simon Riggs wrote:

> On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> It seems possible to write your own functions to support various
>>> possibilities with text search.
>>>
>>> One of the more common thoughts is to have a list of words that you
>>> would like to include, i.e. the opposite of a stop word list.
>>>
>>> There are clear indications that indexing too many words is a problem
>>> for both GIN and GIST. If people already know what they'll be looking
>>> for and what they will never be looking for, it seems easier to supply
>>> that list up front, rather than hide it behind lots of hand-crafted
>>> code.
>>
>> I don't understand what you're proposing. We already have dict_synonym
>> that you can use to accept a simple list of words.
>
> How does that allow me to limit the number of words to a known list?

text search doesn't index unknown words, so if your mapping contains
only one dictionary, this dictionary will control what words to index.
While dict_synonym is good for not big list I'd write separate dictionary
with fast lookup.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 12:35:40
Message-ID: Pine.LNX.4.64.0709101633280.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 10 Sep 2007, Simon Riggs wrote:

> On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
>> On Mon, 10 Sep 2007, Simon Riggs wrote:
>>
>>> It seems possible to write your own functions to support various
>>> possibilities with text search.
>>>
>>> One of the more common thoughts is to have a list of words that you
>>> would like to include, i.e. the opposite of a stop word list.
>>>
>>> There are clear indications that indexing too many words is a problem
>>> for both GIN and GIST. If people already know what they'll be looking
>>> for and what they will never be looking for, it seems easier to supply
>>> that list up front, rather than hide it behind lots of hand-crafted
>>> code.
>>>
>>> Can we include that functionality now?
>>
>> This could be realized very easyly using dict_strict, which returns
>> only known words, and mapping contains only this dictionary. So,
>> feel free to write it and submit.
>
> So there isn't one yet, but you think it will be easy to write and that
> we should call it dict_strict?

we have dict_synonym already and if your list is not big you'll be happy.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 12:44:03
Message-ID: 46E53C13.3060603@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> How does that allow me to limit the number of words to a known list?

If all dictionaries returns NULL for token the this token will not be indexed at
all.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 12:48:16
Message-ID: 46E53D10.5020300@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> There are clear indications that indexing too many words is a problem
> for both GIN and GIST. If people already know what they'll be looking
GIN doesn't depend strongly on number of words. It has log(N) behaviour for
numbers of words because of using B-Tree over words.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 13:27:59
Message-ID: 1189430879.4281.247.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote:
> On Mon, 10 Sep 2007, Simon Riggs wrote:
>
> > On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
> >> On Mon, 10 Sep 2007, Simon Riggs wrote:
> >>
> >>> It seems possible to write your own functions to support various
> >>> possibilities with text search.
> >>>
> >>> One of the more common thoughts is to have a list of words that you
> >>> would like to include, i.e. the opposite of a stop word list.
> >>>
> >>> There are clear indications that indexing too many words is a problem
> >>> for both GIN and GIST. If people already know what they'll be looking
> >>> for and what they will never be looking for, it seems easier to supply
> >>> that list up front, rather than hide it behind lots of hand-crafted
> >>> code.
> >>>
> >>> Can we include that functionality now?
> >>
> >> This could be realized very easyly using dict_strict, which returns
> >> only known words, and mapping contains only this dictionary. So,
> >> feel free to write it and submit.
> >
> > So there isn't one yet, but you think it will be easy to write and that
> > we should call it dict_strict?
>
> we have dict_synonym already and if your list is not big you'll be happy.

So I need to do something like

CREATE TEXT SEARCH DICTIONARY my_diction (
template = snowball,
synonym = include_only_these_words
);

which will then look for a file called include_only_these_words.syn?

I would prefer to be able to do something like this

CREATE TEXT SEARCH DICTIONARY my_diction (
template = snowball,
include = justthese
);
...which makes more sense to anyone reading it
and I also want to make the comparison case insensitive.

Would it be better to
1. include a new dictionary file (dict_strict, as you suggest)
2. a) allow case sensitivity as another option in dictionaries
b) allow "include" as another word for "stoplist", but with the
meaning reversed?

e.g.

CREATE TEXT SEARCH DICTIONARY my_diction (
template = snowball,
include = justthese,
case_sensitive = true
);

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 13:49:42
Message-ID: 1189432182.4281.252.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote:
> > There are clear indications that indexing too many words is a problem
> > for both GIN and GIST. If people already know what they'll be looking

GIN is great, sorry if that sounded negative.

> GIN doesn't depend strongly on number of words. It has log(N) behaviour for
> numbers of words because of using B-Tree over words.

log(N) in the number of distinct words, but every word you index results
in an index insert, so if we index more words than we need then the
insert rate will go down.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 13:58:37
Message-ID: Pine.LNX.4.64.0709101757160.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 10 Sep 2007, Simon Riggs wrote:

> On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote:
>>> There are clear indications that indexing too many words is a problem
>>> for both GIN and GIST. If people already know what they'll be looking
>
> GIN is great, sorry if that sounded negative.
>
>> GIN doesn't depend strongly on number of words. It has log(N) behaviour for
>> numbers of words because of using B-Tree over words.
>
> log(N) in the number of distinct words, but every word you index results
> in an index insert, so if we index more words than we need then the
> insert rate will go down.

yes, there is room to improve support of very long posting lists

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 14:04:13
Message-ID: Pine.LNX.4.64.0709101758520.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 10 Sep 2007, Simon Riggs wrote:

> On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote:
>> On Mon, 10 Sep 2007, Simon Riggs wrote:
>>
>>> On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
>>>> On Mon, 10 Sep 2007, Simon Riggs wrote:
>>>>
>>>>> It seems possible to write your own functions to support various
>>>>> possibilities with text search.
>>>>>
>>>>> One of the more common thoughts is to have a list of words that you
>>>>> would like to include, i.e. the opposite of a stop word list.
>>>>>
>>>>> There are clear indications that indexing too many words is a problem
>>>>> for both GIN and GIST. If people already know what they'll be looking
>>>>> for and what they will never be looking for, it seems easier to supply
>>>>> that list up front, rather than hide it behind lots of hand-crafted
>>>>> code.
>>>>>
>>>>> Can we include that functionality now?
>>>>
>>>> This could be realized very easyly using dict_strict, which returns
>>>> only known words, and mapping contains only this dictionary. So,
>>>> feel free to write it and submit.
>>>
>>> So there isn't one yet, but you think it will be easy to write and that
>>> we should call it dict_strict?
>>
>> we have dict_synonym already and if your list is not big you'll be happy.
>
> So I need to do something like
>
> CREATE TEXT SEARCH DICTIONARY my_diction (
> template = snowball,
> synonym = include_only_these_words
> );
>
> which will then look for a file called include_only_these_words.syn?
>
> I would prefer to be able to do something like this
>
> CREATE TEXT SEARCH DICTIONARY my_diction (
> template = snowball,
> include = justthese
> );
> ...which makes more sense to anyone reading it
> and I also want to make the comparison case insensitive.
>
> Would it be better to
> 1. include a new dictionary file (dict_strict, as you suggest)
> 2. a) allow case sensitivity as another option in dictionaries
> b) allow "include" as another word for "stoplist", but with the
> meaning reversed?
>
> e.g.
>
> CREATE TEXT SEARCH DICTIONARY my_diction (
> template = snowball,
> include = justthese,
> case_sensitive = true
> );

No, you need to write new template, which efficiently works with
big lists and support case insensitive comparison.

CREATE TEXT SEARCH TEMPLATE biglist (
.....
);

CREATE TEXT SEARCH DICTIONARY my_diction (
TEMPLATE = biglist,
DictFile = words,
case_sensitive = true
);

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 14:21:38
Message-ID: 3968.1189434098@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
> On Mon, 10 Sep 2007, Simon Riggs wrote:
>> Can we include that functionality now?

> This could be realized very easyly using dict_strict, which returns
> only known words, and mapping contains only this dictionary. So,
> feel free to write it and submit.

... for 8.4.

regards, tom lane


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2007-09-10 14:24:46
Message-ID: Pine.LNX.4.64.0709101822390.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 10 Sep 2007, Tom Lane wrote:

> Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
>> On Mon, 10 Sep 2007, Simon Riggs wrote:
>>> Can we include that functionality now?
>
>> This could be realized very easyly using dict_strict, which returns
>> only known words, and mapping contains only this dictionary. So,
>> feel free to write it and submit.
>
> ... for 8.4.

It can be just a contrib module. There are several useful dictionaries
we need to port.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Include Lists for Text Search
Date: 2007-09-10 15:37:31
Message-ID: 1189438651.4281.268.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote:
> Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
> > On Mon, 10 Sep 2007, Simon Riggs wrote:
> >> Can we include that functionality now?
>
> > This could be realized very easyly using dict_strict, which returns
> > only known words, and mapping contains only this dictionary. So,
> > feel free to write it and submit.
>
> ... for 8.4.

I've coded a small patch to allow CaseSensitive synonyms.

CREATE TEXT SEARCH DICTIONARY my_diction (
TEMPLATE = biglist,
DictFile = words,
CaseSensitive = true
);

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

Attachment Content-Type Size
ts_casesensitive.v1.patch text/x-patch 2.2 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Include Lists for Text Search
Date: 2007-09-26 08:52:35
Message-ID: 200709260852.l8Q8qaL16955@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote:
> > Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
> > > On Mon, 10 Sep 2007, Simon Riggs wrote:
> > >> Can we include that functionality now?
> >
> > > This could be realized very easyly using dict_strict, which returns
> > > only known words, and mapping contains only this dictionary. So,
> > > feel free to write it and submit.
> >
> > ... for 8.4.
>
> I've coded a small patch to allow CaseSensitive synonyms.
>
> CREATE TEXT SEARCH DICTIONARY my_diction (
> TEMPLATE = biglist,
> DictFile = words,
> CaseSensitive = true
> );
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 03:03:37
Message-ID: 26421.1205118217@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> I've coded a small patch to allow CaseSensitive synonyms.

Applied with corrections (it'd be good if you at least pretended to test
stuff before submitting it).

Would a similar parameter be useful for any of the other dictionary
types?

regards, tom lane


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 06:34:51
Message-ID: Pine.LNX.4.64.0803100930360.10010@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 9 Mar 2008, Tom Lane wrote:

> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> I've coded a small patch to allow CaseSensitive synonyms.
>
> Applied with corrections (it'd be good if you at least pretended to test
> stuff before submitting it).
>
> Would a similar parameter be useful for any of the other dictionary
> types?

There are many things desirable to do with dictionaries, for example,
say dictionary to return an original word plus it's normal form. Another
feature is a not recognize-and-stop dictionaries, but allow
filtering dictionary. We have a feeling that a little middleware would help
implement this, and CaseSensitive too.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 10:43:09
Message-ID: 1205145789.4269.60.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2008-03-09 at 23:03 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I've coded a small patch to allow CaseSensitive synonyms.
>
> Applied with corrections (it'd be good if you at least pretended to test
> stuff before submitting it).

It is a frequent accusation of yours that I don't test things, which is
incorrect. Defending against that makes me a liar twice in your eyes. If
you look more closely at what happens you'll understand that your own
rigid expectations are what causes these problems.

If you thought at all you'd realise that nobody would be stupid enough
to try to sneak untested code into Postgres; all bugs would point
directly back to anybody attempting that. That isn't true just of
Postgres, its true of any group of people working together on any task,
not just software or open source software.

As Greg mentions on another thread, not all patches are *intended* to be
production quality by their authors. Many patches are shared for the
purpose of eliciting general feedback. You yourself encourage a group
development approach and specifically punish those people dropping
completely "finished" code into the queue and expecting it to be
committed as-is. So people produce patches in various states of
readiness, knowing that they may have to produce many versions before it
is finally accepted. Grabbing at a piece of code, then shouting
"unclean, unclean" just destroys the feedback process and leaves
teamwork in tatters.

My arse doesn't need wiping, thanks, nor does my bottom need smacking,
nor are you ever likely to catch me telling fibs. If you think so,
you're wrong and you should reset.

What you will find from me and others, in the past and realistically in
the future too, are patches that vary according to how near to
completion they are. Not the same thing as "completed, yet varying in
quality". If they are incomplete it is because of the idea to receive
feedback at various points. Some patches need almost none e.g. truncate
triggers (1-2 versions), some patches need almost constant feedback e.g.
async commit (24+ versions before commit). The existence of an
intermediate patch in no way signals laziness, lack of intention to
complete or any other failure to appreciate the software development
process.

If you want people to work on Postgres alongside you, I'd appreciate a
software development process that didn't roughly equate to charging at a
machine gun trench across a minefield. If you insist on following that
you should at least stop wondering why it is that the few people to have
made more than a few steps are determined and grim individuals and start
thinking about the many skilled people who have chosen non-combatant
status, and why.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 12:24:08
Message-ID: 47D52868.2020602@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> As Greg mentions on another thread, not all patches are *intended* to be
> production quality by their authors. Many patches are shared for the
> purpose of eliciting general feedback. You yourself encourage a group
> development approach and specifically punish those people dropping
> completely "finished" code into the queue and expecting it to be
> committed as-is.
>

If you post a patch that is not intended to be of production quality, it
is best to mark it so explicitly. Then nobody can point fingers at you.
Also, Bruce would then know not to put it in the queue of patches
waiting for application.

cheers

andrew


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 13:31:09
Message-ID: 1205155869.4269.92.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2008-03-10 at 08:24 -0400, Andrew Dunstan wrote:
>
> Simon Riggs wrote:
> > As Greg mentions on another thread, not all patches are *intended* to be
> > production quality by their authors. Many patches are shared for the
> > purpose of eliciting general feedback. You yourself encourage a group
> > development approach and specifically punish those people dropping
> > completely "finished" code into the queue and expecting it to be
> > committed as-is.

> If you post a patch that is not intended to be of production quality, it
> is best to mark it so explicitly. Then nobody can point fingers at you.
> Also, Bruce would then know not to put it in the queue of patches
> waiting for application.

So it can be forgotten about entirely? Hmmmm.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 13:42:43
Message-ID: 47D53AD3.7040603@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Mon, 2008-03-10 at 08:24 -0400, Andrew Dunstan wrote:
>
>> Simon Riggs wrote:
>>
>>> As Greg mentions on another thread, not all patches are *intended* to be
>>> production quality by their authors. Many patches are shared for the
>>> purpose of eliciting general feedback. You yourself encourage a group
>>> development approach and specifically punish those people dropping
>>> completely "finished" code into the queue and expecting it to be
>>> committed as-is.
>>>
>
>
>> If you post a patch that is not intended to be of production quality, it
>> is best to mark it so explicitly. Then nobody can point fingers at you.
>> Also, Bruce would then know not to put it in the queue of patches
>> waiting for application.
>>
>
> So it can be forgotten about entirely? Hmmmm.
>
>

I think if you post something marked Work In Progress, there is an
implied commitment on your part to post something complete at a later stage.

So if it's forgotten you would be the one doing the forgetting. ;-)

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 14:01:57
Message-ID: 23115.1205157717@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> I think if you post something marked Work In Progress, there is an
> implied commitment on your part to post something complete at a later stage.

It *wasn't* marked Work In Progress, and Simon went out of his way to
cross-post it to -patches, where the thread previously had not been:

http://archives.postgresql.org/pgsql-patches/2007-09/msg00150.php

I don't think either Bruce or I can be faulted for assuming that it was
meant to be applied. In future perhaps I should take it as a given that
Simon doesn't expect his patches to be applied?

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 14:05:22
Message-ID: 1205157922.4269.117.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2008-03-10 at 09:42 -0400, Andrew Dunstan wrote:

> I think if you post something marked Work In Progress, there is an
> implied commitment on your part to post something complete at a later stage.
>
> So if it's forgotten you would be the one doing the forgetting. ;-)

But if they aren't on a review list, they won't get reviewed, no matter
what their status. So everybody has to maintain their own status list
and re-submit patches for review monthly until reviewed?

I like the idea of marking things WIP, but I think we need a clear
system where we agree that multiple statuses exist and that they are
described in particular ways.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 14:17:31
Message-ID: 23353.1205158651@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
> On Sun, 9 Mar 2008, Tom Lane wrote:
>> Would a similar parameter be useful for any of the other dictionary
>> types?

> There are many things desirable to do with dictionaries, for example,
> say dictionary to return an original word plus it's normal form. Another
> feature is a not recognize-and-stop dictionaries, but allow
> filtering dictionary. We have a feeling that a little middleware would help
> implement this, and CaseSensitive too.

Hmm, I can see how some middleware would help with folding or not
folding the input token, but what about the words coming from the
dictionary file (particularly the *output* lexeme)? It's not apparent
to me that it's sensible to try to control that from outside the
dictionary.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 14:24:59
Message-ID: 1205159099.4269.130.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2008-03-10 at 10:01 -0400, Tom Lane wrote:

> In future perhaps I should take it as a given that
> Simon doesn't expect his patches to be applied?

I think you should take it as a given that Simon would like to try to
work together, sharing ideas and code, without insults and public
derision when things don't fit.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] Include Lists for Text Search
Date: 2008-03-10 14:31:53
Message-ID: 200803101431.m2AEVrj02547@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
>
>
> Simon Riggs wrote:
> > As Greg mentions on another thread, not all patches are *intended* to be
> > production quality by their authors. Many patches are shared for the
> > purpose of eliciting general feedback. You yourself encourage a group
> > development approach and specifically punish those people dropping
> > completely "finished" code into the queue and expecting it to be
> > committed as-is.
> >
>
> If you post a patch that is not intended to be of production quality, it
> is best to mark it so explicitly. Then nobody can point fingers at you.
> Also, Bruce would then know not to put it in the queue of patches
> waiting for application.

It would still be in that queue because we might just mark it as a TODO.

FYI, during this first release cycle, we need to apply patches and
decide on TODOs. We skipped TODO discussion during feature freeze, so
we need to do it now for held ideas.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 14:59:27
Message-ID: 47D54CCF.1090904@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> Hmm, I can see how some middleware would help with folding or not
> folding the input token, but what about the words coming from the
> dictionary file (particularly the *output* lexeme)? It's not apparent
> to me that it's sensible to try to control that from outside the
> dictionary.

Right now I see an significant advantage of such layer: two possible extension
of dictionary (filtering and storing original form) are independent from nature
of dictionary. So, instead of modifying of every dictionary we can add some
layer, common for all dictionary. With syntax like:

CREATE/ALTER TEXT SEARCH DICTIONARY foo (...) WITH ( filtering=on|off,
store_original=on|off );

Or per token's type/dictionary pair.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] Include Lists for Text Search
Date: 2008-03-10 15:10:55
Message-ID: 24143.1205161855@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> Hmm, I can see how some middleware would help with folding or not
>> folding the input token, but what about the words coming from the
>> dictionary file (particularly the *output* lexeme)? It's not apparent
>> to me that it's sensible to try to control that from outside the
>> dictionary.

> Right now I see an significant advantage of such layer: two possible extension
> of dictionary (filtering and storing original form) are independent from nature
> of dictionary. So, instead of modifying of every dictionary we can add some
> layer, common for all dictionary. With syntax like:

> CREATE/ALTER TEXT SEARCH DICTIONARY foo (...) WITH ( filtering=on|off,
> store_original=on|off );

> Or per token's type/dictionary pair.

Well, if you think this can/should be done somewhere outside the
dictionary, should I revert the applied patch?

regards, tom lane


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] Include Lists for Text Search
Date: 2008-03-10 15:21:03
Message-ID: 47D551DF.6000305@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> Well, if you think this can/should be done somewhere outside the
> dictionary, should I revert the applied patch?

No, that patch is about case sensitivity of synonym dictionary. I suppose, Simon
wants to replace 'bill' to 'account', but doesn't want to get 'account Clinton'

For another dictionary ( dictionary of number, snowball ) that option is a
meaningless.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: Include Lists for Text Search
Date: 2008-03-10 16:01:46
Message-ID: 47D55B6A.8080002@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> Right now I see an significant advantage of such layer: two possible
> extension of dictionary (filtering and storing original form) are

One more extension: drop too long words. For example, decrease limit of max
length of word to prevent long to be indexed - word with 100 characters is
suspiciously long for human input.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/