BUG #4562: ts_headline() adds space when parsing url

Lists: pgsql-bugs
From: "Denis Monsieur" <dmonsieur(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #4562: ts_headline() adds space when parsing url
Date: 2008-12-03 23:33:18
Message-ID: 200812032333.mB3NXInQ049716@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 4562
Logged by: Denis Monsieur
Email address: dmonsieur(at)gmail(dot)com
PostgreSQL version: 8.3.4
Operating system: Debian etch
Description: ts_headline() adds space when parsing url
Details:

My system is 8.3.4, but people in #postgresql with 8.3.5 have confirmed the
issue.

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)


From: "gildas prime" <g(dot)prime(at)aeschemunex(dot)com>
To: "Denis Monsieur" <dmonsieur(at)gmail(dot)com>, <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #4562: ts_headline() adds space when parsing url
Date: 2008-12-04 08:33:18
Message-ID: 043E5400842F594D89696F862B48DEA14428FD@hades.aes-local.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Same thing on 8.3.5 Win32

ester=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

ester=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

ester=#

Gildas

-----Message d'origine-----
De : pgsql-bugs-owner(at)postgresql(dot)org [mailto:pgsql-bugs-owner(at)postgresql(dot)org] De la part de Denis Monsieur
Envoyé : jeudi 4 décembre 2008 00:33
À : pgsql-bugs(at)postgresql(dot)org
Objet : [BUGS] BUG #4562: ts_headline() adds space when parsing url

The following bug has been logged online:

Bug reference: 4562
Logged by: Denis Monsieur
Email address: dmonsieur(at)gmail(dot)com
PostgreSQL version: 8.3.4
Operating system: Debian etch
Description: ts_headline() adds space when parsing url
Details:

My system is 8.3.4, but people in #postgresql with 8.3.5 have confirmed the
issue.

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

--
Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Denis Monsieur" <dmonsieur(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: BUG #4562: ts_headline() adds space when parsing url
Date: 2008-12-09 01:20:28
Message-ID: 4357.1228785628@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Denis Monsieur" <dmonsieur(at)gmail(dot)com> writes:
> The problem is a space being added to text in the form of
> http://some.url/path
> Compare the output:

> shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
> ts_headline
> -----------------
> http://some.url
> (1 row)

> shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
> ts_headline
> -----------------------
> http:// some.url/path
> (1 row)

I looked into this, and it seems that the problem is that
generateHeadline() emits a space for any token marked as replace = 1.
I think it probably shouldn't emit anything at all. AFAICS the cases
where replace will get set are token types URL, TAG, NUMHWORD,
ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
certainly undesirable, because these token types are just respecifying
text that is also covered by their component tokens. The only case
where you could make an argument that the space is useful is TAG,
as in

regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
ts_headline
-------------
http blah
(1 row)

But it seems to me to be at least as plausible that you should get
nothing as that you should get a space for a removed tag.

Comments?

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Denis Monsieur <dmonsieur(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: BUG #4562: ts_headline() adds space when parsing url
Date: 2009-01-15 01:38:31
Message-ID: 200901150138.n0F1cVf11755@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


This bug still exists in my testing.

---------------------------------------------------------------------------

Tom Lane wrote:
> "Denis Monsieur" <dmonsieur(at)gmail(dot)com> writes:
> > The problem is a space being added to text in the form of
> > http://some.url/path
> > Compare the output:
>
> > shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
> > ts_headline
> > -----------------
> > http://some.url
> > (1 row)
>
> > shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
> > ts_headline
> > -----------------------
> > http:// some.url/path
> > (1 row)
>
> I looked into this, and it seems that the problem is that
> generateHeadline() emits a space for any token marked as replace = 1.
> I think it probably shouldn't emit anything at all. AFAICS the cases
> where replace will get set are token types URL, TAG, NUMHWORD,
> ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
> certainly undesirable, because these token types are just respecifying
> text that is also covered by their component tokens. The only case
> where you could make an argument that the space is useful is TAG,
> as in
>
> regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
> ts_headline
> -------------
> http blah
> (1 row)
>
> But it seems to me to be at least as plausible that you should get
> nothing as that you should get a space for a removed tag.
>
> Comments?
>
> regards, tom lane
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Denis Monsieur <dmonsieur(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: BUG #4562: ts_headline() adds space when parsing url
Date: 2009-01-15 06:19:49
Message-ID: Pine.LNX.4.64.0901150919130.9554@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Wed, 14 Jan 2009, Bruce Momjian wrote:

>
> This bug still exists in my testing.

We fixed all issues with ts_headline and will submit soon.

>
> ---------------------------------------------------------------------------
>
> Tom Lane wrote:
>> "Denis Monsieur" <dmonsieur(at)gmail(dot)com> writes:
>>> The problem is a space being added to text in the form of
>>> http://some.url/path
>>> Compare the output:
>>
>>> shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
>>> ts_headline
>>> -----------------
>>> http://some.url
>>> (1 row)
>>
>>> shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
>>> ts_headline
>>> -----------------------
>>> http:// some.url/path
>>> (1 row)
>>
>> I looked into this, and it seems that the problem is that
>> generateHeadline() emits a space for any token marked as replace = 1.
>> I think it probably shouldn't emit anything at all. AFAICS the cases
>> where replace will get set are token types URL, TAG, NUMHWORD,
>> ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
>> certainly undesirable, because these token types are just respecifying
>> text that is also covered by their component tokens. The only case
>> where you could make an argument that the space is useful is TAG,
>> as in
>>
>> regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
>> ts_headline
>> -------------
>> http blah
>> (1 row)
>>
>> But it seems to me to be at least as plausible that you should get
>> nothing as that you should get a space for a removed tag.
>>
>> Comments?
>>
>> regards, tom lane
>>
>> --
>> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-bugs
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Denis Monsieur <dmonsieur(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: BUG #4562: ts_headline() adds space when parsing url
Date: 2009-01-15 17:17:29
Message-ID: 200901151717.n0FHHTZ10605@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


This has been fixed and will be in the next 8.3 minor release.

---------------------------------------------------------------------------

Tom Lane wrote:
> "Denis Monsieur" <dmonsieur(at)gmail(dot)com> writes:
> > The problem is a space being added to text in the form of
> > http://some.url/path
> > Compare the output:
>
> > shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
> > ts_headline
> > -----------------
> > http://some.url
> > (1 row)
>
> > shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
> > ts_headline
> > -----------------------
> > http:// some.url/path
> > (1 row)
>
> I looked into this, and it seems that the problem is that
> generateHeadline() emits a space for any token marked as replace = 1.
> I think it probably shouldn't emit anything at all. AFAICS the cases
> where replace will get set are token types URL, TAG, NUMHWORD,
> ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
> certainly undesirable, because these token types are just respecifying
> text that is also covered by their component tokens. The only case
> where you could make an argument that the space is useful is TAG,
> as in
>
> regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
> ts_headline
> -------------
> http blah
> (1 row)
>
> But it seems to me to be at least as plausible that you should get
> nothing as that you should get a space for a removed tag.
>
> Comments?
>
> regards, tom lane
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +