Re: Fragments in tsearch2 headline

Lists: pgsql-general
From: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Fragments in tsearch2 headline
Date: 2007-10-24 21:13:53
Message-ID: b0943d9e0710241413t7e2149c5ud7f11cd49ca903de@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

(I first posted it via google groups and realised that I have to be
subscribed; now posting directly)

I searched the list but couldn't find anyone raising the issue (or it
might simply be my way of using the tool).

I'd like to search through some text documents for words and generate
headlines. The search works fine but, if the words are far apart in
the document, the headline only highlights one of the words. Enlarging
the headline with Min/MaxWords is not an option as I have limited
space for displaying it.

Is there an easy way to generate a headline from separate fragments
containing the search words and maybe separated by "..."? An option is
to generate separate headlines and concatenate them before displaying
(with a problem when the words are in the same fragment). Another
option would be to somehow get the found words position in the
document (anyone knows how?) and generate the headline myself.

Any other ideas?

Thanks.

--
Catalin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-28 03:39:05
Message-ID: 20717.1193542745@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com> writes:
> Is there an easy way to generate a headline from separate fragments
> containing the search words and maybe separated by "..."?

Hmm, the documentation for ts_headline claims it does this already:

<function>ts_headline</function> accepts a document along
with a query, and returns one or more ellipsis-separated excerpts from
the document in which terms from the query are highlighted.

However, a quick look at the code suggests this is a lie --- I see no
evidence whatever that there's any smarts for putting in ellipses.

Oleg, Teodor, is there something I missed here? Or do we need to change
the documentation?

regards, tom lane


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-28 04:14:28
Message-ID: Pine.LNX.4.64.0710280705320.14368@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sat, 27 Oct 2007, Tom Lane wrote:

> "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com> writes:
>> Is there an easy way to generate a headline from separate fragments
>> containing the search words and maybe separated by "..."?
>
> Hmm, the documentation for ts_headline claims it does this already:
>
> <function>ts_headline</function> accepts a document along
> with a query, and returns one or more ellipsis-separated excerpts from
> the document in which terms from the query are highlighted.
>
> However, a quick look at the code suggests this is a lie --- I see no
> evidence whatever that there's any smarts for putting in ellipses.
>
> Oleg, Teodor, is there something I missed here? Or do we need to change
> the documentation?

Probably documentation is not correct here. 'ellipsis-separated' should be
treated as a general wording. Default highlighting is <b>..</b> as it
stated below in docs.

postgres=# select ts_headline('this is a highlighted text','highlight'::tsquery, 'StartSel=...,StopSel=...')
postgres-# ;
ts_headline
----------------------------------
this is a ...highlighted... text

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 10:41:27
Message-ID: b0943d9e0710300341h34d68bbp4a717d681b769b3f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 28/10/2007, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> On Sat, 27 Oct 2007, Tom Lane wrote:
>
> > "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com> writes:
> >> Is there an easy way to generate a headline from separate fragments
> >> containing the search words and maybe separated by "..."?
> >
> > Hmm, the documentation for ts_headline claims it does this already:
[...]
> > However, a quick look at the code suggests this is a lie --- I see no
> > evidence whatever that there's any smarts for putting in ellipses.
>
> Probably documentation is not correct here. 'ellipsis-separated' should be
> treated as a general wording. Default highlighting is <b>..</b> as it
> stated below in docs.

It seems that I'll have to implement the headline outside the query
(Python, in my case). I would use to_tsvector and to_tsquery to
generate the lexemes and the work position, add them to a hash table
and use the position of the matching lexemes to generate the headline.

I could also highlight the full text and generate the headline I want
based on it but if I limit the number of excerpts, it gets complicated
to avoid the same lexeme being shown in all excerpts. Is a lexeme
always a substring of the corresponding token (so that I can use
simple regexp)?

Any other ideas?

Thanks.

--
Catalin


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 10:52:07
Message-ID: Pine.LNX.4.64.0710301350190.14368@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Catalin,

what is your need ? What's wrong with this ?

postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3 xyz','2'::tsquery, 'StartSel=...,StopSel=...')
;
ts_headline
-------------------------------------------
1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz

Oleg
On Tue, 30 Oct 2007, Catalin Marinas wrote:

> On 28/10/2007, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>> On Sat, 27 Oct 2007, Tom Lane wrote:
>>
>>> "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com> writes:
>>>> Is there an easy way to generate a headline from separate fragments
>>>> containing the search words and maybe separated by "..."?
>>>
>>> Hmm, the documentation for ts_headline claims it does this already:
> [...]
>>> However, a quick look at the code suggests this is a lie --- I see no
>>> evidence whatever that there's any smarts for putting in ellipses.
>>
>> Probably documentation is not correct here. 'ellipsis-separated' should be
>> treated as a general wording. Default highlighting is <b>..</b> as it
>> stated below in docs.
>
> It seems that I'll have to implement the headline outside the query
> (Python, in my case). I would use to_tsvector and to_tsquery to
> generate the lexemes and the work position, add them to a hash table
> and use the position of the matching lexemes to generate the headline.
>
> I could also highlight the full text and generate the headline I want
> based on it but if I limit the number of excerpts, it gets complicated
> to avoid the same lexeme being shown in all excerpts. Is a lexeme
> always a substring of the corresponding token (so that I can use
> simple regexp)?
>
> Any other ideas?
>
> Thanks.
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Richard Huxton <dev(at)archonet(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 11:05:09
Message-ID: 47270FE5.4070102@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Oleg Bartunov wrote:
> Catalin,
>
> what is your need ? What's wrong with this ?
>
> postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3
> xyz','2'::tsquery, 'StartSel=...,StopSel=...')
> ;
> ts_headline
> -------------------------------------------
> 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz

I think he want's something like: "1 2 3 ... abc 2 3 ..."

A few characters of context around each match and then ... between. Kind
of like grep -C.

--
Richard Huxton
Archonet Ltd


From: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>
To: "Richard Huxton" <dev(at)archonet(dot)com>
Cc: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 13:42:25
Message-ID: b0943d9e0710300642u60033bb2m3a2b0cef4308c1a1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 30/10/2007, Richard Huxton <dev(at)archonet(dot)com> wrote:
> Oleg Bartunov wrote:
> > Catalin,
> >
> > what is your need ? What's wrong with this ?
> >
> > postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3
> > xyz','2'::tsquery, 'StartSel=...,StopSel=...')
> > ;
> > ts_headline
> > -------------------------------------------
> > 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz
>
> I think he want's something like: "1 2 3 ... abc 2 3 ..."
>
> A few characters of context around each match and then ... between. Kind
> of like grep -C.

That's pretty much correct (with the difference that I'd like context
of words rather than lines as in "grep" and StartSel=<b>,
StopSel=</b>).

Since the text I want a headline for might be pretty long (tens of
lines), I'd like to only show the excerpts around the matching words.
Similar to the above example:

select ts_headline('1 2 3 4 5 3 4 abc x y z 2 3', '2 & abc'::tsquery);

should give:

'1 <b>2</b> 3 4 ... 3 4 <b>abc</b> x y'

Currently, if you limit the maximum words so that 'abc' is too far, it
only highlights the first match.

Many of the search engines (including google) show the headline this
way. I think Lucene can do this as well but I've never used it to be
sure.

--
Catalin


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>
Cc: Richard Huxton <dev(at)archonet(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 15:39:53
Message-ID: Pine.LNX.4.64.0710301834340.14368@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 30 Oct 2007, Catalin Marinas wrote:

> On 30/10/2007, Richard Huxton <dev(at)archonet(dot)com> wrote:
>> Oleg Bartunov wrote:
>>> Catalin,
>>>
>>> what is your need ? What's wrong with this ?
>>>
>>> postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3
>>> xyz','2'::tsquery, 'StartSel=...,StopSel=...')
>>> ;
>>> ts_headline
>>> -------------------------------------------
>>> 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz
>>
>> I think he want's something like: "1 2 3 ... abc 2 3 ..."
>>
>> A few characters of context around each match and then ... between. Kind
>> of like grep -C.
>
> That's pretty much correct (with the difference that I'd like context
> of words rather than lines as in "grep" and StartSel=<b>,
> StopSel=</b>).
>
> Since the text I want a headline for might be pretty long (tens of
> lines), I'd like to only show the excerpts around the matching words.
> Similar to the above example:
>
> select ts_headline('1 2 3 4 5 3 4 abc x y z 2 3', '2 & abc'::tsquery);
>
> should give:
>
> '1 <b>2</b> 3 4 ... 3 4 <b>abc</b> x y'
>
> Currently, if you limit the maximum words so that 'abc' is too far, it
> only highlights the first match.

ok, then you have to formalize many things - how long should be excerpts,
how much excerpts to show, etc. In tsearch2 we have get_covers() function,
which produces all excerpts like:

=# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), '2&3'::tsquery);
get_covers
------------------------------------------------
1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3
(1 row)

Once you formalize your requirements, you can look on it and adapt to your
needs (and share with people). I think it could be nice contrib module.

>
> Many of the search engines (including google) show the headline this
> way. I think Lucene can do this as well but I've never used it to be
> sure.
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Sushant Sinha" <sushant354(at)gmail(dot)com>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>, "Richard Huxton" <dev(at)archonet(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 17:11:58
Message-ID: 9fb559330710301011n77ef2544n4ef73dfce3177ac4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

This is a nice idea and seems easy to implement. I will try to write
it down and send a patch to the mailing list.

I was also working to add support for phrase search. Currently to
check for phrase you have to match the entire document. It will be
better if a filter like are_words_consecutive(tsvector *t, tsquery *q)
can be added to reduce the number of matching documents before we
actually do the phrase search. Do you think this will improve the
performance of phrase search? If so I will like to write this
function and send a patch.

-Sushant.

On 10/30/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> On Tue, 30 Oct 2007, Catalin Marinas wrote:
>
> > On 30/10/2007, Richard Huxton <dev(at)archonet(dot)com> wrote:
> >> Oleg Bartunov wrote:
> >>> Catalin,
> >>>
> >>> what is your need ? What's wrong with this ?
> >>>
> >>> postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3
> >>> xyz','2'::tsquery, 'StartSel=...,StopSel=...')
> >>> ;
> >>> ts_headline
> >>> -------------------------------------------
> >>> 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz
> >>
> >> I think he want's something like: "1 2 3 ... abc 2 3 ..."
> >>
> >> A few characters of context around each match and then ... between. Kind
> >> of like grep -C.
> >
> > That's pretty much correct (with the difference that I'd like context
> > of words rather than lines as in "grep" and StartSel=<b>,
> > StopSel=</b>).
> >
> > Since the text I want a headline for might be pretty long (tens of
> > lines), I'd like to only show the excerpts around the matching words.
> > Similar to the above example:
> >
> > select ts_headline('1 2 3 4 5 3 4 abc x y z 2 3', '2 & abc'::tsquery);
> >
> > should give:
> >
> > '1 <b>2</b> 3 4 ... 3 4 <b>abc</b> x y'
> >
> > Currently, if you limit the maximum words so that 'abc' is too far, it
> > only highlights the first match.
>
> ok, then you have to formalize many things - how long should be excerpts,
> how much excerpts to show, etc. In tsearch2 we have get_covers() function,
> which produces all excerpts like:
>
> =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'),
> '2&3'::tsquery);
> get_covers
> ------------------------------------------------
> 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3
> (1 row)
>
> Once you formalize your requirements, you can look on it and adapt to your
> needs (and share with people). I think it could be nice contrib module.
>
>
> >
> > Many of the search engines (including google) show the headline this
> > way. I think Lucene can do this as well but I've never used it to be
> > sure.
> >
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>, "Richard Huxton" <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 17:29:52
Message-ID: 25101.1193765392@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> On 10/30/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>> ... In tsearch2 we have get_covers() function,
>> which produces all excerpts like:

I had not realized till just now that the 8.3 core version of tsearch
omitted any material feature of contrib/tsearch2. Why was get_covers()
left out?

regards, tom lane


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-30 20:54:47
Message-ID: Pine.LNX.4.64.0710302353100.14368@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 30 Oct 2007, Tom Lane wrote:

>> On 10/30/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>> ... In tsearch2 we have get_covers() function,
>>> which produces all excerpts like:
>
> I had not realized till just now that the 8.3 core version of tsearch
> omitted any material feature of contrib/tsearch2. Why was get_covers()
> left out?

That time we considered it as developers function useful only for debugging.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Richard Huxton" <dev(at)archonet(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-10-31 23:26:47
Message-ID: b0943d9e0710311626t2dea3849q286cca7799478e92@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 30/10/2007, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> ok, then you have to formalize many things - how long should be excerpts,
> how much excerpts to show, etc. In tsearch2 we have get_covers() function,
> which produces all excerpts like:
>
> =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), '2&3'::tsquery);
> get_covers
> ------------------------------------------------
> 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3
> (1 row)

This function generates the lexemes, so cannot be used directly, but
it is probably a good starting point.

> Once you formalize your requirements, you can look on it and adapt to your
> needs (and share with people). I think it could be nice contrib module.

It seems that Sushant already wants to implement this function. He
would probably be faster than me :-) (I'm relatively new to db stuff).
Since I mainly rely on whatever a web hosting company provides, I'll
probably stick with a Python implementation outside the SQL query.

Thanks for your answers.

--
Catalin


From: "Sushant Sinha" <sushant354(at)gmail(dot)com>
To: "Catalin Marinas" <catalin(dot)marinas(at)gmail(dot)com>
Cc: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Richard Huxton" <dev(at)archonet(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-11-12 03:46:50
Message-ID: 9fb559330711111946s34786d4ckf0fcef0b23626add@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I wrote a headline generation function for my app and I have attached
the patch (against the cvs head). It generates multiple contexts in
which the query appears. Essentially, it uses the cover function to
generate all covers, chooses smallest covers and stretches each
selected cover according to the chosen parameters. I think ideally
changes should be made to prsd_headline function but I couldn't
understand that segment of code well.

The sql interface is

headline_with_fragments(text parser, tsvector docvector, text doc,
tsquery queryin, int4 maxcoverSize, int4 mincoverSize, int4 maxWords)
RETURNS text

This will generate headline that contain maxWords and each cover
stretched to maxcoverSize. It will not add any fragment with less than
mincoverSize.
I am running my app with maxcoverSize = 20, mincoverSize = 5, maxWords = 40.
So it shows roughly two fragments per query.

If Teoder or Oleg want to add this to main branch, I will be happy to
clean it up and test it better.

-Sushant.

On Oct 31, 2007 6:26 PM, Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com> wrote:
> On 30/10/2007, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> > ok, then you have to formalize many things - how long should be excerpts,
> > how much excerpts to show, etc. In tsearch2 we have get_covers() function,
> > which produces all excerpts like:
> >
> > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), '2&3'::tsquery);
> > get_covers
> > ------------------------------------------------
> > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3
> > (1 row)
>
> This function generates the lexemes, so cannot be used directly, but
> it is probably a good starting point.
>
> > Once you formalize your requirements, you can look on it and adapt to your
> > needs (and share with people). I think it could be nice contrib module.
>
> It seems that Sushant already wants to implement this function. He
> would probably be faster than me :-) (I'm relatively new to db stuff).
> Since I mainly rely on whatever a web hosting company provides, I'll
> probably stick with a Python implementation outside the SQL query.
>
> Thanks for your answers.
>
> --
> Catalin
>
> ---------------------------(end of broadcast)---------------------------
>
> TIP 5: don't forget to increase your free space map settings
>

Attachment Content-Type Size
headline_with_fragments.patch text/x-patch 11.1 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Sushant Sinha <sushant354(at)gmail(dot)com>
Cc: Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Richard Huxton <dev(at)archonet(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2007-11-21 23:44:18
Message-ID: 200711212344.lALNiI216561@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Sushant Sinha wrote:
> I wrote a headline generation function for my app and I have attached
> the patch (against the cvs head). It generates multiple contexts in
> which the query appears. Essentially, it uses the cover function to
> generate all covers, chooses smallest covers and stretches each
> selected cover according to the chosen parameters. I think ideally
> changes should be made to prsd_headline function but I couldn't
> understand that segment of code well.
>
> The sql interface is
>
> headline_with_fragments(text parser, tsvector docvector, text doc,
> tsquery queryin, int4 maxcoverSize, int4 mincoverSize, int4 maxWords)
> RETURNS text
>
> This will generate headline that contain maxWords and each cover
> stretched to maxcoverSize. It will not add any fragment with less than
> mincoverSize.
> I am running my app with maxcoverSize = 20, mincoverSize = 5, maxWords = 40.
> So it shows roughly two fragments per query.
>
> If Teoder or Oleg want to add this to main branch, I will be happy to
> clean it up and test it better.
>
> -Sushant.
>
>
>
>
> On Oct 31, 2007 6:26 PM, Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com> wrote:
> > On 30/10/2007, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> > > ok, then you have to formalize many things - how long should be excerpts,
> > > how much excerpts to show, etc. In tsearch2 we have get_covers() function,
> > > which produces all excerpts like:
> > >
> > > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), '2&3'::tsquery);
> > > get_covers
> > > ------------------------------------------------
> > > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3
> > > (1 row)
> >
> > This function generates the lexemes, so cannot be used directly, but
> > it is probably a good starting point.
> >
> > > Once you formalize your requirements, you can look on it and adapt to your
> > > needs (and share with people). I think it could be nice contrib module.
> >
> > It seems that Sushant already wants to implement this function. He
> > would probably be faster than me :-) (I'm relatively new to db stuff).
> > Since I mainly rely on whatever a web hosting company provides, I'll
> > probably stick with a Python implementation outside the SQL query.
> >
> > Thanks for your answers.
> >
> > --
> > Catalin
> >
> > ---------------------------(end of broadcast)---------------------------
> >
> > TIP 5: don't forget to increase your free space map settings
> >

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Sushant Sinha <sushant354(at)gmail(dot)com>
Cc: Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Richard Huxton <dev(at)archonet(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: Fragments in tsearch2 headline
Date: 2008-03-17 18:27:44
Message-ID: 200803171827.m2HIRiM08492@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Teodor, Oleg, do we want this?

http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php

---------------------------------------------------------------------------

Sushant Sinha wrote:
> I wrote a headline generation function for my app and I have attached
> the patch (against the cvs head). It generates multiple contexts in
> which the query appears. Essentially, it uses the cover function to
> generate all covers, chooses smallest covers and stretches each
> selected cover according to the chosen parameters. I think ideally
> changes should be made to prsd_headline function but I couldn't
> understand that segment of code well.
>
> The sql interface is
>
> headline_with_fragments(text parser, tsvector docvector, text doc,
> tsquery queryin, int4 maxcoverSize, int4 mincoverSize, int4 maxWords)
> RETURNS text
>
> This will generate headline that contain maxWords and each cover
> stretched to maxcoverSize. It will not add any fragment with less than
> mincoverSize.
> I am running my app with maxcoverSize = 20, mincoverSize = 5, maxWords = 40.
> So it shows roughly two fragments per query.
>
> If Teoder or Oleg want to add this to main branch, I will be happy to
> clean it up and test it better.
>
> -Sushant.
>
>
>
>
> On Oct 31, 2007 6:26 PM, Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com> wrote:
> > On 30/10/2007, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> > > ok, then you have to formalize many things - how long should be excerpts,
> > > how much excerpts to show, etc. In tsearch2 we have get_covers() function,
> > > which produces all excerpts like:
> > >
> > > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), '2&3'::tsquery);
> > > get_covers
> > > ------------------------------------------------
> > > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3
> > > (1 row)
> >
> > This function generates the lexemes, so cannot be used directly, but
> > it is probably a good starting point.
> >
> > > Once you formalize your requirements, you can look on it and adapt to your
> > > needs (and share with people). I think it could be nice contrib module.
> >
> > It seems that Sushant already wants to implement this function. He
> > would probably be faster than me :-) (I'm relatively new to db stuff).
> > Since I mainly rely on whatever a web hosting company provides, I'll
> > probably stick with a Python implementation outside the SQL query.
> >
> > Thanks for your answers.
> >
> > --
> > Catalin
> >
> > ---------------------------(end of broadcast)---------------------------
> >
> > TIP 5: don't forget to increase your free space map settings
> >

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Sushant Sinha <sushant354(at)gmail(dot)com>, Catalin Marinas <catalin(dot)marinas(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Richard Huxton <dev(at)archonet(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: Fragments in tsearch2 headline
Date: 2008-03-17 19:00:54
Message-ID: 47DEBFE6.2040601@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


> Teodor, Oleg, do we want this?
> http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php

I suppose, we want it. But there are a questions/issues:
- Is it needed to introduce new function? may be it will be better to add option
to existing headline function. I'd like to keep current layout: ts_headline
provides some common interface to headline generation. Finding and marking
fragments is deal of parser's headline method and generation of exact pieces of
text is made by ts_headline.
- Covers may be overlapped. So, overlapped fragments will be looked odd.

In any case, the patch was developed for contrib version of tsearch.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Fragments in tsearch2 headline
Date: 2008-03-29 04:35:21
Message-ID: 1206765321.10128.25.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Ah I missed this email. I agree with Teodor that this is not the best
way to implement this functionality. At the time I was in a bit of hurry
to have something better than the default one and just hacked this. And
if we want to have this functionality across languages and parsers it
will be better to be implemented in the general framework.

The patch takes into account the corner case of overlap. Here is the
code for that
// start check
if (!startHL && *currentpos >= startpos)
startHL = 1;

The headline generation will not start until currentpos has gone past
startpos.

You can also check how this headline function is working at my website
indiankanoon.com. Some example queries are murder, freedom of speech,
freedom of press etc.

Should I develop the patch for the current cvs head of postgres?

Thanks,
-Sushant.

On Mon, 2008-03-17 at 22:00 +0300, Teodor Sigaev wrote:
> > Teodor, Oleg, do we want this?
> > http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php
>
> I suppose, we want it. But there are a questions/issues:
> - Is it needed to introduce new function? may be it will be better to add option
> to existing headline function. I'd like to keep current layout: ts_headline
> provides some common interface to headline generation. Finding and marking
> fragments is deal of parser's headline method and generation of exact pieces of
> text is made by ts_headline.
> - Covers may be overlapped. So, overlapped fragments will be looked odd.
>
>
> In any case, the patch was developed for contrib version of tsearch.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Fragments in tsearch2 headline
Date: 2008-03-31 12:36:28
Message-ID: 47F0DACC.4030704@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> The patch takes into account the corner case of overlap. Here is the
> code for that
> // start check
> if (!startHL && *currentpos >= startpos)
> startHL = 1;
>
> The headline generation will not start until currentpos has gone past
> startpos.
Ok

>
> You can also check how this headline function is working at my website
> indiankanoon.com. Some example queries are murder, freedom of speech,
> freedom of press etc.
Looks good.

> Should I develop the patch for the current cvs head of postgres?

I'd like to commit your patch, but if it should be:
- for current HEAD
- as extension of existing ts_headline.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: sushant354(at)gmail(dot)com, pgsql-general(at)postgresql(dot)org
Subject: Re: Fragments in tsearch2 headline
Date: 2008-05-09 03:29:42
Message-ID: 200805090329.m493TgZ26711@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Where are we on this?

---------------------------------------------------------------------------

Teodor Sigaev wrote:
> > The patch takes into account the corner case of overlap. Here is the
> > code for that
> > // start check
> > if (!startHL && *currentpos >= startpos)
> > startHL = 1;
> >
> > The headline generation will not start until currentpos has gone past
> > startpos.
> Ok
>
> >
> > You can also check how this headline function is working at my website
> > indiankanoon.com. Some example queries are murder, freedom of speech,
> > freedom of press etc.
> Looks good.
>
> > Should I develop the patch for the current cvs head of postgres?
>
> I'd like to commit your patch, but if it should be:
> - for current HEAD
> - as extension of existing ts_headline.
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW: http://www.sigaev.ru/
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +