Re: dot to be considered as a word delimiter?

Lists: pgsql-hackers
From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: shamnad(at)gmail(dot)com
Subject: dot to be considered as a word delimiter?
Date: 2009-05-30 05:59:29
Message-ID: 1243663169.12123.244.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Currently it seems like that dot is not considered as a word delimiter
by the english parser.

lawdb=# select to_tsvector('english', 'Mr.J.Sai Deepak');
to_tsvector
-------------------------
'deepak':2 'mr.j.sai':1
(1 row)

So the word obtained is "mr.j.sai" rather than three words "mr", "j",
"sai"

It does it correctly if there is space in between, as space is
definitely a word delimiter.

lawdb=# select to_tsvector('english', 'Mr. J. Sai Deepak');
to_tsvector
---------------------------------
'j':2 'mr':1 'sai':3 'deepak':4
(1 row)

I think that dot should be considered by as a word delimiter because
when dot is not followed by a space, most of the time it is an error in
typing. Beside they are not many valid english words that have dot in
between.

-Sushant.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <sushant354(at)gmail(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Cc: <shamnad(at)gmail(dot)com>
Subject: Re: dot to be considered as a word delimiter?
Date: 2009-06-02 01:22:23
Message-ID: 4A24387E.EE98.0025.1@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sushant Sinha <sushant354(at)gmail(dot)com> wrote:

> I think that dot should be considered by as a word delimiter because
> when dot is not followed by a space, most of the time it is an error
> in typing. Beside they are not many valid english words that have
> dot in between.

It's not treating it as an English word, but as a host name.

select ts_debug('english', 'Mr.J.Sai Deepak');
ts_debug
---------------------------------------------------------------------------
(host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all
ASCII",Deepak,{english_stem},english_stem,{deepak})
(3 rows)

You could run it through a dictionary which would deal with host
tokens differently. Just be aware of what you'll be doing to
www.google.com if you run into it.

I hope this helps.

-Kevin


From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, shamnad(at)gmail(dot)com
Subject: Re: dot to be considered as a word delimiter?
Date: 2009-06-02 12:47:25
Message-ID: 20090602124725.GD18879@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> Sushant Sinha <sushant354(at)gmail(dot)com> wrote:
>
> > I think that dot should be considered by as a word delimiter because
> > when dot is not followed by a space, most of the time it is an error
> > in typing. Beside they are not many valid english words that have
> > dot in between.
>
> It's not treating it as an English word, but as a host name.
>
> select ts_debug('english', 'Mr.J.Sai Deepak');
> ts_debug
> ---------------------------------------------------------------------------
> (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
> (blank,"Space symbols"," ",{},,)
> (asciiword,"Word, all
> ASCII",Deepak,{english_stem},english_stem,{deepak})
> (3 rows)
>
> You could run it through a dictionary which would deal with host
> tokens differently. Just be aware of what you'll be doing to
> www.google.com if you run into it.
>
> I hope this helps.
>
> -Kevin
>

In our uses for full text indexing, it is much more important to
be able to find host name and URLs than to find mistyped names.
My two cents.

Cheers,
Ken


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Kenneth Marshall <ktm(at)rice(dot)edu>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, shamnad(at)gmail(dot)com
Subject: Re: dot to be considered as a word delimiter?
Date: 2009-06-02 20:40:51
Message-ID: 9fb559330906021340l1f9f520s57310aa034af3511@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fair enough. I agree that there is a valid need for returning such tokens as
a host. But I think there is definitely a need to break it down into
individual words. This will help in cases when a document is missing a space
in between the words.

So what we can do is: return the entire compound word as Host and also break
it down into individual words. I can put up a patch for this if you guys
agree.

Returning multiple tokens for the same word is a feature of the text search
parser as explained in the documentation here:
http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html

Thanks,
Sushant.

On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall <ktm(at)rice(dot)edu> wrote:

> On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> > Sushant Sinha <sushant354(at)gmail(dot)com> wrote:
> >
> > > I think that dot should be considered by as a word delimiter because
> > > when dot is not followed by a space, most of the time it is an error
> > > in typing. Beside they are not many valid english words that have
> > > dot in between.
> >
> > It's not treating it as an English word, but as a host name.
> >
> > select ts_debug('english', 'Mr.J.Sai Deepak');
> > ts_debug
> >
> ---------------------------------------------------------------------------
> > (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
> > (blank,"Space symbols"," ",{},,)
> > (asciiword,"Word, all
> > ASCII",Deepak,{english_stem},english_stem,{deepak})
> > (3 rows)
> >
> > You could run it through a dictionary which would deal with host
> > tokens differently. Just be aware of what you'll be doing to
> > www.google.com if you run into it.
> >
> > I hope this helps.
> >
> > -Kevin
> >
>
> In our uses for full text indexing, it is much more important to
> be able to find host name and URLs than to find mistyped names.
> My two cents.
>
> Cheers,
> Ken
>


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Sushant Sinha" <sushant354(at)gmail(dot)com>, "Kenneth Marshall" <ktm(at)rice(dot)edu>
Cc: <shamnad(at)gmail(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dot to be considered as a word delimiter?
Date: 2009-06-02 20:57:02
Message-ID: 4A254BCE.EE98.0025.1@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sushant Sinha <sushant354(at)gmail(dot)com> wrote:

> So what we can do is: return the entire compound word as Host and
> also break it down into individual words.

So, pretty much like we handle hyphenation?

-Kevin


From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Sushant Sinha <sushant354(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, shamnad(at)gmail(dot)com
Subject: Re: dot to be considered as a word delimiter?
Date: 2009-06-02 20:57:49
Message-ID: 20090602205749.GJ18879@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 02, 2009 at 04:40:51PM -0400, Sushant Sinha wrote:
> Fair enough. I agree that there is a valid need for returning such tokens as
> a host. But I think there is definitely a need to break it down into
> individual words. This will help in cases when a document is missing a space
> in between the words.
>
>
> So what we can do is: return the entire compound word as Host and also break
> it down into individual words. I can put up a patch for this if you guys
> agree.
>
> Returning multiple tokens for the same word is a feature of the text search
> parser as explained in the documentation here:
> http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html
>
> Thanks,
> Sushant.
>

+1

Ken
> On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall <ktm(at)rice(dot)edu> wrote:
>
> > On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> > > Sushant Sinha <sushant354(at)gmail(dot)com> wrote:
> > >
> > > > I think that dot should be considered by as a word delimiter because
> > > > when dot is not followed by a space, most of the time it is an error
> > > > in typing. Beside they are not many valid english words that have
> > > > dot in between.
> > >
> > > It's not treating it as an English word, but as a host name.
> > >
> > > select ts_debug('english', 'Mr.J.Sai Deepak');
> > > ts_debug
> > >
> > ---------------------------------------------------------------------------
> > > (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
> > > (blank,"Space symbols"," ",{},,)
> > > (asciiword,"Word, all
> > > ASCII",Deepak,{english_stem},english_stem,{deepak})
> > > (3 rows)
> > >
> > > You could run it through a dictionary which would deal with host
> > > tokens differently. Just be aware of what you'll be doing to
> > > www.google.com if you run into it.
> > >
> > > I hope this helps.
> > >
> > > -Kevin
> > >
> >
> > In our uses for full text indexing, it is much more important to
> > be able to find host name and URLs than to find mistyped names.
> > My two cents.
> >
> > Cheers,
> > Ken
> >