Re: [GENERAL] Fragments in tsearch2 headline

Lists: pgsql-generalpgsql-hackers
From: "Pierre-Yves Strub" <pierre(dot)yves(dot)strub(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Fragments in tsearch2 headline
Date: 2008-05-22 11:31:55
Message-ID: 592647410805220431w5ae1439by8a89433a0a141dc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
(http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)

W.r.t, http://archives.postgresql.org/pgsql-general/2008-03/msg00806.php
I can continue the work until this becomes an acceptable patch for pg.

Pierre-yves.

Attachment Content-Type Size
headlines.patch text/x-patch 18.8 KB

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Fragments in tsearch2 headline
Date: 2008-05-23 05:10:27
Message-ID: 1211519427.8498.10.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Thanks Pierre for porting this! I just tested this for my application
and it works. There was a small bug in that startHL has to be
initialized to 0 for each chosen cover. I fixed that and attached the
new patch.

Teodor did not want a separate function. He wanted it as an extension to
ts_headline. One way to do this will be to invoke it only when options
like MaxCoverSize is used. It will be slightly ugly though.

It still seems to have bugs. I will try to clean that up.

-Sushant.

On Thu, 2008-05-22 at 13:31 +0200, Pierre-Yves Strub wrote:
> Hi,
>
> I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
> (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)
>
> W.r.t, http://archives.postgresql.org/pgsql-general/2008-03/msg00806.php
> I can continue the work until this becomes an acceptable patch for pg.
>
> Pierre-yves.

Attachment Content-Type Size
headlines_v0.1.patch text/x-patch 18.9 KB

From: "Pierre-Yves Strub" <pierre(dot)yves(dot)strub(at)gmail(dot)com>
To: sushant354(at)gmail(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Fragments in tsearch2 headline
Date: 2008-05-23 09:25:22
Message-ID: 592647410805230225k37cf4dcamc7461acabfb21781@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Fri, May 23, 2008 at 7:10 AM, Sushant Sinha <sushant354(at)gmail(dot)com> wrote:
> Teodor did not want a separate function. He wanted it as an extension to
> ts_headline. One way to do this will be to invoke it only when options
> like MaxCoverSize is used. It will be slightly ugly though.

What I understand is that Teodor wants hlparsetext to be used instead
of hlparsetext_with_cover. This is not too hard to do if hlparsetext
gives the positions of lexems. I started writting the patch, but I'm
stucked with the function LexizeExec which I do not totally understand
(... and is not well documents too :) )


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fragments in tsearch2 headline
Date: 2008-05-24 03:57:16
Message-ID: 4837921C.8000905@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

[moved to -hackers, because talk is about implementation details]

> I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
> (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)
Thank you.

1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
now contrib/tsearch2 is compatibility layer for old applications - they don't
know about new features. So, this part isn't needed.

2 solution to compile function (ts_headline_with_fragments) into core, but
using it only from contrib module looks very odd. So, new feature can be used
only with compatibility layer for old release :)

3 headline_with_fragments() is hardcoded to use default parser, but what will be
in case when configuration uses another parser? For example, for japanese language.

4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )
and function should accept 'NumFragments=>N' for default parser. Another parsers
may use another options.

5 it just doesn't work correctly, because new code doesn't care of parser
specific type of lexemes.
contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
wow', 'asd', '');
headline_with_fragments
----------------------------------
...wow asd-wow<b>asd</b>-wow wow
(1 row)

So, I incline to use existing framework/infrastructure although it may be a
subject to change.

Some description:
1 ts_headline defines a correct parser to use
2 it calls hlparsetext to split text into structure suitable for both goals:
find the best fragment(s) and concatenate that fragment(s) back to the text
representation
3 it calls parser specific method prsheadline which works with preparsed text
(parse was done in hlparsetext). Method should mark a needed
words/parts/lexemes etc.
4 ts_headline glues fragments into text and returns that.

We need a parser's headline method because only parser knows all about its lexemes.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>
Cc: sushant354(at)gmail(dot)com, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-05-24 04:07:48
Message-ID: 48379494.4020504@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers


> stucked with the function LexizeExec which I do not totally understand
> (... and is not well documents too :) )

Sorry for that. LexizeExec() is a play around supporting thesaurus dictionary,
which is designed to replace phrase by phrase. So, if it see first matched word
then it asks the parse to get next word. On next word it have four options:
- asks next word
- returns substituting phrase
- says: Hey, I have a match, but it's possible to find a largest match, so
give a next word
- says about false match, so parser should return back and try next dictionary
from the begining or last match.

Now only thesaurus dictionary can work in that mode but nothing forbids to have
another dictionary with phrase recognition.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-05-24 21:18:57
Message-ID: 1211663937.7293.26.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Now I understand the code much better. A few more questions on headline
generation that I was not able to get from the code:

1. Why is hlparsetext used to parse the document rather than the
parsetext function? Since words to be included in the headline will be
marked afterwords, it seems more reasonable to just use the parsetext
function.

The main difference I see is the use of hlfinditem and marking whether
some word is repeated.

The reason this is important is that hlparsetext does not seem to be
storing word positions which parsetext does. The word positions are
important for generating headline with fragments.

2.
> I would prefer the signature ts_headline( [regconfig,] text, tsquery
>[,text] )and function should accept 'NumFragments=>N' for default
>parser. Another parsers may use another options.

Does this mean we want a unified function ts_headline and we trigger the
fragments if NumFragments is specified? It seems that introducing a new
function which can take configuration OID, or name is complex as there
are so many functions handling these issues in wparser.c.

If this is true then we need to just add marking of headline words in
prsd_headline. Otherwise we will need another prsd_headline_with_covers
function.

3. In many cases people may already have TSVector for a given document
(for search operation). Would it be faster to pass TSVector to headline
function when compared to computing TSVector each time? If that is the
case then should we have an option to pass TSVector to headline
function?

-Sushant.

On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote:
> [moved to -hackers, because talk is about implementation details]
>
> > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
> > (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)
> Thank you.
>
> 1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
> now contrib/tsearch2 is compatibility layer for old applications - they don't
> know about new features. So, this part isn't needed.
>
> 2 solution to compile function (ts_headline_with_fragments) into core, but
> using it only from contrib module looks very odd. So, new feature can be used
> only with compatibility layer for old release :)
>
> 3 headline_with_fragments() is hardcoded to use default parser, but what will be
> in case when configuration uses another parser? For example, for japanese language.
>
> 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )
> and function should accept 'NumFragments=>N' for default parser. Another parsers
> may use another options.
>
> 5 it just doesn't work correctly, because new code doesn't care of parser
> specific type of lexemes.
> contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
> wow', 'asd', '');
> headline_with_fragments
> ----------------------------------
> ...wow asd-wow<b>asd</b>-wow wow
> (1 row)
>
>
> So, I incline to use existing framework/infrastructure although it may be a
> subject to change.
>
> Some description:
> 1 ts_headline defines a correct parser to use
> 2 it calls hlparsetext to split text into structure suitable for both goals:
> find the best fragment(s) and concatenate that fragment(s) back to the text
> representation
> 3 it calls parser specific method prsheadline which works with preparsed text
> (parse was done in hlparsetext). Method should mark a needed
> words/parts/lexemes etc.
> 4 ts_headline glues fragments into text and returns that.
>
> We need a parser's headline method because only parser knows all about its lexemes.
>
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW: http://www.sigaev.ru/
>
>


From: "Pierre-Yves Strub" <pierre(dot)yves(dot)strub(at)gmail(dot)com>
To: sushant354(at)gmail(dot)com
Cc: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-05-25 00:20:05
Message-ID: 592647410805241720i395bc8e8o770843dfce93cd1f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Sat, May 24, 2008 at 11:18 PM, Sushant Sinha <sushant354(at)gmail(dot)com> wrote:

> Does this mean we want a unified function ts_headline and we trigger the
> fragments if NumFragments is specified? It seems that introducing a new
> function which can take configuration OID, or name is complex as there
> are so many functions handling these issues in wparser.c.
>
> If this is true then we need to just add marking of headline words in
> prsd_headline. Otherwise we will need another prsd_headline_with_covers
> function.

I think that we could merge down the two functions using a default
cover of the whole text when NumCovers if missing. I started writing
the function. The only missing information is, as you stated, the
missing posititions in HeadlineParsedText.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-05-27 09:30:51
Message-ID: 483BD4CB.3030006@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi!

> 1. Why is hlparsetext used to parse the document rather than the
> parsetext function? Since words to be included in the headline will be
> marked afterwords, it seems more reasonable to just use the parsetext
> function.
> The main difference I see is the use of hlfinditem and marking whether
> some word is repeated.
hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext
doesn't.
hlparsetext preserves original form of lexemes. parsetext doesn't.

>
> The reason this is important is that hlparsetext does not seem to be
> storing word positions which parsetext does. The word positions are
> important for generating headline with fragments.
Doesn't needed - hlparsetext preserves the whole text, so, position is a number
of array.

>
> 2.
>> I would prefer the signature ts_headline( [regconfig,] text, tsquery
>> [,text] )and function should accept 'NumFragments=>N' for default
>> parser. Another parsers may use another options.
>
> Does this mean we want a unified function ts_headline and we trigger the
> fragments if NumFragments is specified?

Trigger should be inside parser-specific function (pg_ts_parser.prsheadline).
Another parsers might not recognize that option.

> It seems that introducing a new
> function which can take configuration OID, or name is complex as there
> are so many functions handling these issues in wparser.c.
No, of course - ts_headline takes care about finding configuration and calling
correct parser.

>
> If this is true then we need to just add marking of headline words in
> prsd_headline. Otherwise we will need another prsd_headline_with_covers
> function.
Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change
an array of HeadlineParsedText.

>
> 3. In many cases people may already have TSVector for a given document
> (for search operation). Would it be faster to pass TSVector to headline
> function when compared to computing TSVector each time? If that is the
> case then should we have an option to pass TSVector to headline
> function?
As I mentioned above, tsvector doesn;t contain whole information about text.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-05-31 23:58:41
Message-ID: 1212278321.5891.24.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.

DESCRIPTION

HeadlineParsedText contains an array of actual words but not
information about the norms. We need an indexed position vector for each
norm so that we can quickly evaluate a number of possible fragments.
Something that tsvector provides.

So this patch changes HeadlineParsedText to contain the norms
(ParsedText). This field is updated while parsing in hlparsetext. The
position information of the norms corresponds to the position of words
in HeadlineParsedText (not to the norms positions as is the case in
tsvector). This works correctly with the current parser. If you think
there may be issues with other parsers please let me know.

This approach does not change any other interface and fits nicely with
the overall framework.

The norms are converted into tsvector and a number of covers are
generated. The best covers are then chosen to be in the headline. The
covers are separated using a hardcoded coversep. Let me know if you want
to expose this as an option.

Covers that overlap with already chosen covers are excluded.

Some options like ShortWord and MinWords are not taken care of right
now. MaxWords are used as maxcoversize. Let me know if you would like to
see other options for fragment generation as well.

Let me know any more changes you would like to see.

-Sushant.

On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote:
> Hi!
>
> > 1. Why is hlparsetext used to parse the document rather than the
> > parsetext function? Since words to be included in the headline will be
> > marked afterwords, it seems more reasonable to just use the parsetext
> > function.
> > The main difference I see is the use of hlfinditem and marking whether
> > some word is repeated.
> hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext
> doesn't.
> hlparsetext preserves original form of lexemes. parsetext doesn't.
>
> >
> > The reason this is important is that hlparsetext does not seem to be
> > storing word positions which parsetext does. The word positions are
> > important for generating headline with fragments.
> Doesn't needed - hlparsetext preserves the whole text, so, position is a number
> of array.
>
> >
> > 2.
> >> I would prefer the signature ts_headline( [regconfig,] text, tsquery
> >> [,text] )and function should accept 'NumFragments=>N' for default
> >> parser. Another parsers may use another options.
> >
> > Does this mean we want a unified function ts_headline and we trigger the
> > fragments if NumFragments is specified?
>
> Trigger should be inside parser-specific function (pg_ts_parser.prsheadline).
> Another parsers might not recognize that option.
>
> > It seems that introducing a new
> > function which can take configuration OID, or name is complex as there
> > are so many functions handling these issues in wparser.c.
> No, of course - ts_headline takes care about finding configuration and calling
> correct parser.
>
> >
> > If this is true then we need to just add marking of headline words in
> > prsd_headline. Otherwise we will need another prsd_headline_with_covers
> > function.
> Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change
> an array of HeadlineParsedText.
>
> >
> > 3. In many cases people may already have TSVector for a given document
> > (for search operation). Would it be faster to pass TSVector to headline
> > function when compared to computing TSVector each time? If that is the
> > case then should we have an option to pass TSVector to headline
> > function?
> As I mentioned above, tsvector doesn;t contain whole information about text.
>

Attachment Content-Type Size
headlines_v0.2.patch text/x-patch 18.2 KB

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-02 14:10:21
Message-ID: 4843FF4D.8030109@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> I have attached a new patch with respect to the current cvs head. This
> produces headline in a document for a given query. Basically it
> identifies fragments of text that contain the query and displays them.
New variant is much better, but...

> HeadlineParsedText contains an array of actual words but not
> information about the norms. We need an indexed position vector for each
> norm so that we can quickly evaluate a number of possible fragments.
> Something that tsvector provides.

Why do you need to store norms? The single purpose of norms is identifying words
from query - but it's already done by hlfinditem. It sets
HeadlineWordEntry->item to corresponding QueryOperand in tsquery.
Look, headline function is rather expensive and your patch adds a lot of extra
work - at least in memory usage. And if user calls with NumFragments=0 the that
work is unneeded.

> This approach does not change any other interface and fits nicely with
> the overall framework.
Yeah, it's a really big step forward. Thank you. You are very close to
committing except: Did you find a hlCover() function which produce a cover from
original HeadlineParsedText representation? Is any reason to do not use it?

>
> The norms are converted into tsvector and a number of covers are
> generated. The best covers are then chosen to be in the headline. The
> covers are separated using a hardcoded coversep. Let me know if you want
> to expose this as an option.

>
> Covers that overlap with already chosen covers are excluded.
>
> Some options like ShortWord and MinWords are not taken care of right
> now. MaxWords are used as maxcoversize. Let me know if you would like to
> see other options for fragment generation as well.
ShortWord, MinWords and MaxWords should store their meaning, but for each
fragment, not for the whole headline.

>
> Let me know any more changes you would like to see.

if (num_fragments == 0)
/* call the default headline generator */
mark_hl_words(prs, query, highlight, shortword, min_words, max_words);
else
mark_hl_fragments(prs, query, highlight, num_fragments, max_words);

Suppose, num_fragments < 2?

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-03 03:13:49
Message-ID: 1212462829.8047.38.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Efficiency: I realized that we do not need to store all norms. We need
to only store store norms that are in the query. So I moved the addition
of norms from addHLParsedLex to hlfinditem. This should add very little
memory overhead to existing headline generation.

If this is still not acceptable for default headline generation, then I
can push it into mark_hl_fragments. But I think any headline marking
function will benefit by having the norms corresponding to the query.

Why we need norms?

hlCover does the exact thing that Cover in tsrank does which is to find
the cover that contains the query. However hlcover has to go through
words that do not match the query. Cover on the other hand operates on
position indexes for just the query words and so it should be faster.

The main reason why I would I like it to be fast is that I want to
generate all covers for a given query. Then choose covers with smallest
length as they will be the one that will best explain relation of a
query to a document. Finally stretch those covers to the specified size.

In my understanding, the current headline generation tries to find the
biggest cover for display in the headline. I personally think that such
a cover does not explain the context of a query in a document. We may
differ on this and thats why we may need both options.

Let me know what you think on this patch and I will update the patch to
respect other options like MinWords and ShortWord.

NumFragments < 2:
I wanted people to use the new headline marker if they specify
NumFragments >= 1. If they do not specify the NumFragments or put it to
0 then the default marker will be used. This becomes a bit of tricky
parameter so please put in any idea on how to trigger the new marker.

On an another note I found that make_tsvector crashes if it receives a
ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
as 1 even when it gets 0 words. I am not sure if this is the desired
behavior.

-Sushant.

On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote:
> > I have attached a new patch with respect to the current cvs head. This
> > produces headline in a document for a given query. Basically it
> > identifies fragments of text that contain the query and displays them.
> New variant is much better, but...
>
> > HeadlineParsedText contains an array of actual words but not
> > information about the norms. We need an indexed position vector for each
> > norm so that we can quickly evaluate a number of possible fragments.
> > Something that tsvector provides.
>
> Why do you need to store norms? The single purpose of norms is identifying words
> from query - but it's already done by hlfinditem. It sets
> HeadlineWordEntry->item to corresponding QueryOperand in tsquery.
> Look, headline function is rather expensive and your patch adds a lot of extra
> work - at least in memory usage. And if user calls with NumFragments=0 the that
> work is unneeded.
>
> > This approach does not change any other interface and fits nicely with
> > the overall framework.
> Yeah, it's a really big step forward. Thank you. You are very close to
> committing except: Did you find a hlCover() function which produce a cover from
> original HeadlineParsedText representation? Is any reason to do not use it?
>
> >
> > The norms are converted into tsvector and a number of covers are
> > generated. The best covers are then chosen to be in the headline. The
> > covers are separated using a hardcoded coversep. Let me know if you want
> > to expose this as an option.
>
>
> >
> > Covers that overlap with already chosen covers are excluded.
> >
> > Some options like ShortWord and MinWords are not taken care of right
> > now. MaxWords are used as maxcoversize. Let me know if you would like to
> > see other options for fragment generation as well.
> ShortWord, MinWords and MaxWords should store their meaning, but for each
> fragment, not for the whole headline.
>
>
> >
> > Let me know any more changes you would like to see.
>
> if (num_fragments == 0)
> /* call the default headline generator */
> mark_hl_words(prs, query, highlight, shortword, min_words, max_words);
> else
> mark_hl_fragments(prs, query, highlight, num_fragments, max_words);
>
>
> Suppose, num_fragments < 2?
>

Attachment Content-Type Size
headlines_v0.3.patch text/x-patch 18.3 KB

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-03 18:53:06
Message-ID: 48459312.9070505@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> Why we need norms?

We don't need norms at all - all matched HeadlineWordEntry already marked by
HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in
tsquery.

> hlCover does the exact thing that Cover in tsrank does which is to find
> the cover that contains the query. However hlcover has to go through
> words that do not match the query. Cover on the other hand operates on
> position indexes for just the query words and so it should be faster.
Cover, by definition, is a minimal continuous text's piece matched by query. May
be a several covers in text and hlCover will find all of them. Next,
prsd_headline() (for now) tries to define the best one. "Best" means: cover
contains a lot of words from query, not less that MinWords, not greater than
MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc.
>
> The main reason why I would I like it to be fast is that I want to
> generate all covers for a given query. Then choose covers with smallest
hlCover generates all covers.

> Let me know what you think on this patch and I will update the patch to
> respect other options like MinWords and ShortWord.

As I understand, you very wish to call Cover() function instead of hlCover() -
by design, they should be identical, but accepts different document's
representation. So, the best way is generalize them: develop a new one which can
be called with some kind of callback or/and opaque structure to use it in both
rank and headline.

>
> NumFragments < 2:
> I wanted people to use the new headline marker if they specify
> NumFragments >= 1. If they do not specify the NumFragments or put it to
Ok, but if you unify cover generation and NumFragments == 1 then result for old
and new algorithms should be the same...

> On an another note I found that make_tsvector crashes if it receives a
> ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
> as 1 even when it gets 0 words. I am not sure if this is the desired
> behavior.
In all places there is a check before call of make_tsvector.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-04 00:14:32
Message-ID: 1212538472.5848.29.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.

I also compared performance of ts_headline with my first patch to
headline generation (one which was a separate function and took tsvector
as input). The performance was dramatically different. For one query
ts_headline took roughly 200 ms while headline_with_fragments took just
70 ms. On an another query ts_headline took 76 ms while
headline_with_fragments took 24 ms. You can find 'explain analyze' for
the first query at the bottom of the page.

These queries were run multiple times to ensure that I never hit the
disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running
Linux 2.6.22-gentoo-r8.

A couple of caveats:

1. ts_headline testing was done with current cvs head where as
headline_with_fragments was done with postgres 8.3.1.

2. For headline_with_fragments, TSVector for the document was obtained
by joining with another table.

Are these differences understandable?

If you think these caveats are the reasons or there is something I am
missing, then I can repeat the entire experiments with exactly the same
conditions.

-Sushant.

Here is 'explain analyze' for both the functions:

ts_headline
------------

lawdb=# explain analyze SELECT ts_headline('english', doc, q, '')
FROM docraw, plainto_tsquery('english', 'freedom of
speech') as q
WHERE docraw.tid = 125596;
QUERY
PLAN

Nested Loop (cost=0.00..8.31 rows=1 width=497) (actual
time=199.692..200.207 rows=1 loops=1)
-> Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1
width=465) (actual time=0.041..0.065 rows=1 loops=1)
Index Cond: (tid = 125596)
-> Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual
time=0.010..0.014 rows=1 loops=1)
Total runtime: 200.311 ms

headline_with_fragments
-----------------------

lawdb=# explain analyze SELECT headline_with_fragments('english',
docvector, doc, q, 'MaxWords=40')
FROM docraw, docmeta, plainto_tsquery('english', 'freedom
of speech') as q
WHERE docraw.tid = 125596 and docmeta.tid=125596;
QUERY
PLAN
----------------------
Nested Loop (cost=0.00..16.61 rows=1 width=883) (actual
time=70.564..70.949 rows=1 loops=1)
-> Nested Loop (cost=0.00..16.59 rows=1 width=851) (actual
time=0.064..0.094 rows=1 loops=1)
-> Index Scan using docraw_pkey on docraw (cost=0.00..8.29
rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1)
Index Cond: (tid = 125596)
-> Index Scan using docmeta_pkey on docmeta (cost=0.00..8.29
rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1)
Index Cond: (docmeta.tid = 125596)
-> Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual
time=0.012..0.016 rows=1 loops=1)
Total runtime: 71.076 ms
(8 rows)

On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote:
> > Why we need norms?
>
> We don't need norms at all - all matched HeadlineWordEntry already marked by
> HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in
> tsquery.
>
> > hlCover does the exact thing that Cover in tsrank does which is to find
> > the cover that contains the query. However hlcover has to go through
> > words that do not match the query. Cover on the other hand operates on
> > position indexes for just the query words and so it should be faster.
> Cover, by definition, is a minimal continuous text's piece matched by query. May
> be a several covers in text and hlCover will find all of them. Next,
> prsd_headline() (for now) tries to define the best one. "Best" means: cover
> contains a lot of words from query, not less that MinWords, not greater than
> MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc.
> >
> > The main reason why I would I like it to be fast is that I want to
> > generate all covers for a given query. Then choose covers with smallest
> hlCover generates all covers.
>
> > Let me know what you think on this patch and I will update the patch to
> > respect other options like MinWords and ShortWord.
>
> As I understand, you very wish to call Cover() function instead of hlCover() -
> by design, they should be identical, but accepts different document's
> representation. So, the best way is generalize them: develop a new one which can
> be called with some kind of callback or/and opaque structure to use it in both
> rank and headline.
>
> >
> > NumFragments < 2:
> > I wanted people to use the new headline marker if they specify
> > NumFragments >= 1. If they do not specify the NumFragments or put it to
> Ok, but if you unify cover generation and NumFragments == 1 then result for old
> and new algorithms should be the same...
>
>
> > On an another note I found that make_tsvector crashes if it receives a
> > ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
> > as 1 even when it gets 0 words. I am not sure if this is the desired
> > behavior.
> In all places there is a check before call of make_tsvector.
>


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-05 16:21:26
Message-ID: 48481286.5080103@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> A couple of caveats:
>
> 1. ts_headline testing was done with current cvs head where as
> headline_with_fragments was done with postgres 8.3.1.
> 2. For headline_with_fragments, TSVector for the document was obtained
> by joining with another table.
> Are these differences understandable?

That is possible situation because ts_headline has several criterias of 'best'
covers - length, number of words from query, good words at the begin and at the
end of headline while your fragment's algorithm takes care only on total number
of words in all covers. It's not very good, but it's acceptable, I think.
Headline (and ranking too) hasn't any formal rules to define is it good or bad?
Just a people's opinions.

Next possible reason: original algorithm had a look on all covers trying to find
the best one while your algorithm tries to find just the shortest covers to fill
a headline.

But it's very desirable to use ShortWord - it's not very comfortable for user if
one option produces unobvious side effect with another one.
`

> If you think these caveats are the reasons or there is something I am
> missing, then I can repeat the entire experiments with exactly the same
> conditions.

Interesting for me test is a comparing hlCover with Cover in your patch, i.e.
develop a patch which uses hlCover instead of Cover and compare old patch with
new one.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-21 14:00:53
Message-ID: 1214056853.8689.10.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I have an attached an updated patch with following changes:

1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual words and excludes spaces etc.

I have also changed NumFragments option to MaxFragments as there may not
be enough covers to display NumFragments.

Another change that I was thinking:

Right now if cover size > max_words then I just cut the trailing words.
Instead I was thinking that we should split the cover into more
fragments such that each fragment contains a few query words. Then each
fragment will not contain all query words but will show more occurrences
of query words in the headline. I would like to know what your opinion
on this is.

-Sushant.

On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote:
> > A couple of caveats:
> >
> > 1. ts_headline testing was done with current cvs head where as
> > headline_with_fragments was done with postgres 8.3.1.
> > 2. For headline_with_fragments, TSVector for the document was obtained
> > by joining with another table.
> > Are these differences understandable?
>
> That is possible situation because ts_headline has several criterias of 'best'
> covers - length, number of words from query, good words at the begin and at the
> end of headline while your fragment's algorithm takes care only on total number
> of words in all covers. It's not very good, but it's acceptable, I think.
> Headline (and ranking too) hasn't any formal rules to define is it good or bad?
> Just a people's opinions.
>
> Next possible reason: original algorithm had a look on all covers trying to find
> the best one while your algorithm tries to find just the shortest covers to fill
> a headline.
>
> But it's very desirable to use ShortWord - it's not very comfortable for user if
> one option produces unobvious side effect with another one.
> `
>
> > If you think these caveats are the reasons or there is something I am
> > missing, then I can repeat the entire experiments with exactly the same
> > conditions.
>
> Interesting for me test is a comparing hlCover with Cover in your patch, i.e.
> develop a patch which uses hlCover instead of Cover and compare old patch with
> new one.

Attachment Content-Type Size
headlines_v0.5.patch text/x-patch 11.0 KB

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-06-30 19:05:16
Message-ID: 48692E6C.80309@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers


> 1. Respects ShortWord and MinWords
> 2. Uses hlCover instead of Cover
> 3. Does not store norm (or lexeme) for headline marking
> 4. Removes ts_rank.h
> 5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
> only counts the actual words and excludes spaces etc.
>
> I have also changed NumFragments option to MaxFragments as there may not
> be enough covers to display NumFragments.

Nice. But it will be good to resolve following issues:
1) Patch contains mistakes, I didn't investigate or carefully read it. Get
http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz and load in db.

Queries
# select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from
apod where to_tsvector(body) @@ plainto_tsquery('black hole');

and

# select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from
apod;

crash postgresql :(

2) pls, include in your patch documentation and regression tests.

>
> Another change that I was thinking:
>
> Right now if cover size > max_words then I just cut the trailing words.
> Instead I was thinking that we should split the cover into more
> fragments such that each fragment contains a few query words. Then each
> fragment will not contain all query words but will show more occurrences
> of query words in the headline. I would like to know what your opinion
> on this is.

Agreed.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: "Sushant Sinha" <sushant354(at)gmail(dot)com>
To: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc: "Pierre-Yves Strub" <pierre(dot)yves(dot)strub(at)gmail(dot)com>, "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-15 04:50:29
Message-ID: 9fb559330807142150m75fa325fv52f161e6857a712d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Attached a new patch that:

1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
Basically it divides a cover greater than MaxWords into fragments of
MaxWords, resizes each such fragment so that each end of the fragment
contains a query word and then evaluates best fragments based on number of
query words in each fragment. In case of tie it picks up the smaller
fragment. This allows more query words to be shown with multiple fragments
in case a single cover is larger than the MaxWords.

The resizing of a fragment such that each end is a query word provides room
for stretching both sides of the fragment. This (hopefully) better presents
the context in which query words appear in the document. If a cover is
smaller than MaxWords then the cover is treated as a fragment.

Let me know if you have any more suggestions or anything is not clear.

I have not yet added the regression tests. The regression test suite seemed
to be only ensuring that the function works. How many tests should I be
adding? Is there any other place that I need to add different test cases for
the function?

-Sushant.

Nice. But it will be good to resolve following issues:
> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and load in db.
>
> Queries
> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
>
> and
>
> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> from apod;
>
> crash postgresql :(
>
> 2) pls, include in your patch documentation and regression tests.
>
>
>> Another change that I was thinking:
>>
>> Right now if cover size > max_words then I just cut the trailing words.
>> Instead I was thinking that we should split the cover into more
>> fragments such that each fragment contains a few query words. Then each
>> fragment will not contain all query words but will show more occurrences
>> of query words in the headline. I would like to know what your opinion
>> on this is.
>>
>
> Agreed.
>
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW:
> http://www.sigaev.ru/
>

Attachment Content-Type Size
headlines_v0.6.patch text/x-diff 12.6 KB

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Sushant Sinha <sushant354(at)gmail(dot)com>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-15 09:29:54
Message-ID: 487C6E12.3030906@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> Attached a new patch that:
>
> 1. fixes previous bug
> 2. better handles the case when cover size is greater than the MaxWords.

Looks good, I'll make some tests with real-world application.

> I have not yet added the regression tests. The regression test suite
> seemed to be only ensuring that the function works. How many tests
> should I be adding? Is there any other place that I need to add
> different test cases for the function?

Just add 3-5 selects to src/test/regress/sql/tsearch.sql with checking basic
functionality and corner cases like
- there is no covers in text
- Cover(s) is too big
- and so on

Add some words in documentation too, pls.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-16 03:54:01
Message-ID: 1216180441.10591.1.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

attached are two patches:

1. documentation
2. regression tests
for headline with fragments.

-Sushant.

On Tue, 2008-07-15 at 13:29 +0400, Teodor Sigaev wrote:
> > Attached a new patch that:
> >
> > 1. fixes previous bug
> > 2. better handles the case when cover size is greater than the MaxWords.
>
> Looks good, I'll make some tests with real-world application.
>
> > I have not yet added the regression tests. The regression test suite
> > seemed to be only ensuring that the function works. How many tests
> > should I be adding? Is there any other place that I need to add
> > different test cases for the function?
>
> Just add 3-5 selects to src/test/regress/sql/tsearch.sql with checking basic
> functionality and corner cases like
> - there is no covers in text
> - Cover(s) is too big
> - and so on
>
> Add some words in documentation too, pls.
>
>

Attachment Content-Type Size
headlines_docv0.1.patch text/x-patch 1.7 KB
headlines_regressv0.1.patch text/x-patch 4.5 KB

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Sushant Sinha <sushant354(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-16 22:53:12
Message-ID: Pine.LNX.4.64.0807170250470.11363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Sushant,

first, please, provide simple test queries, which demonstrate the right work
in the corner cases. This will helps reviewers to test your patch and
helps you to make sure your new version is ok. For example:

=# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery);
ts_headline
------------------------------------------------------
<b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b>

This select breaks your code:

=# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
ts_headline
--------------
... 2 ...

and so on ....

Oleg
On Tue, 15 Jul 2008, Sushant Sinha wrote:

> Attached a new patch that:
>
> 1. fixes previous bug
> 2. better handles the case when cover size is greater than the MaxWords.
> Basically it divides a cover greater than MaxWords into fragments of
> MaxWords, resizes each such fragment so that each end of the fragment
> contains a query word and then evaluates best fragments based on number of
> query words in each fragment. In case of tie it picks up the smaller
> fragment. This allows more query words to be shown with multiple fragments
> in case a single cover is larger than the MaxWords.
>
> The resizing of a fragment such that each end is a query word provides room
> for stretching both sides of the fragment. This (hopefully) better presents
> the context in which query words appear in the document. If a cover is
> smaller than MaxWords then the cover is treated as a fragment.
>
> Let me know if you have any more suggestions or anything is not clear.
>
> I have not yet added the regression tests. The regression test suite seemed
> to be only ensuring that the function works. How many tests should I be
> adding? Is there any other place that I need to add different test cases for
> the function?
>
> -Sushant.
>
>
> Nice. But it will be good to resolve following issues:
>> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
>> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and load in db.
>>
>> Queries
>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
>> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
>>
>> and
>>
>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
>> from apod;
>>
>> crash postgresql :(
>>
>> 2) pls, include in your patch documentation and regression tests.
>>
>>
>>> Another change that I was thinking:
>>>
>>> Right now if cover size > max_words then I just cut the trailing words.
>>> Instead I was thinking that we should split the cover into more
>>> fragments such that each fragment contains a few query words. Then each
>>> fragment will not contain all query words but will show more occurrences
>>> of query words in the headline. I would like to know what your opinion
>>> on this is.
>>>
>>
>> Agreed.
>>
>>
>> --
>> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
>> WWW:
>> http://www.sigaev.ru/
>>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-16 23:08:53
Message-ID: 1216249733.5842.5.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I will add test queries and their results for the corner cases in a
separate file. I guess the only thing I am confused about is what should
be the behavior of headline generation when Query items have words of
size less than ShortWord. I guess the answer is to ignore ShortWord
parameter but let me know if the answer is any different.

-Sushant.

On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
> Sushant,
>
> first, please, provide simple test queries, which demonstrate the right work
> in the corner cases. This will helps reviewers to test your patch and
> helps you to make sure your new version is ok. For example:
>
> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery);
> ts_headline
> ------------------------------------------------------
> <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b>
>
> This select breaks your code:
>
> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
> ts_headline
> --------------
> ... 2 ...
>
> and so on ....
>
>
> Oleg
> On Tue, 15 Jul 2008, Sushant Sinha wrote:
>
> > Attached a new patch that:
> >
> > 1. fixes previous bug
> > 2. better handles the case when cover size is greater than the MaxWords.
> > Basically it divides a cover greater than MaxWords into fragments of
> > MaxWords, resizes each such fragment so that each end of the fragment
> > contains a query word and then evaluates best fragments based on number of
> > query words in each fragment. In case of tie it picks up the smaller
> > fragment. This allows more query words to be shown with multiple fragments
> > in case a single cover is larger than the MaxWords.
> >
> > The resizing of a fragment such that each end is a query word provides room
> > for stretching both sides of the fragment. This (hopefully) better presents
> > the context in which query words appear in the document. If a cover is
> > smaller than MaxWords then the cover is treated as a fragment.
> >
> > Let me know if you have any more suggestions or anything is not clear.
> >
> > I have not yet added the regression tests. The regression test suite seemed
> > to be only ensuring that the function works. How many tests should I be
> > adding? Is there any other place that I need to add different test cases for
> > the function?
> >
> > -Sushant.
> >
> >
> > Nice. But it will be good to resolve following issues:
> >> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
> >> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and load in db.
> >>
> >> Queries
> >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
> >>
> >> and
> >>
> >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >> from apod;
> >>
> >> crash postgresql :(
> >>
> >> 2) pls, include in your patch documentation and regression tests.
> >>
> >>
> >>> Another change that I was thinking:
> >>>
> >>> Right now if cover size > max_words then I just cut the trailing words.
> >>> Instead I was thinking that we should split the cover into more
> >>> fragments such that each fragment contains a few query words. Then each
> >>> fragment will not contain all query words but will show more occurrences
> >>> of query words in the headline. I would like to know what your opinion
> >>> on this is.
> >>>
> >>
> >> Agreed.
> >>
> >>
> >> --
> >> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> >> WWW:
> >> http://www.sigaev.ru/
> >>
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Sushant Sinha <sushant354(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-16 23:28:54
Message-ID: Pine.LNX.4.64.0807170327060.11363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Wed, 16 Jul 2008, Sushant Sinha wrote:

> I will add test queries and their results for the corner cases in a
> separate file. I guess the only thing I am confused about is what should
> be the behavior of headline generation when Query items have words of
> size less than ShortWord. I guess the answer is to ignore ShortWord
> parameter but let me know if the answer is any different.
>

ShortWord is about headline text, it doesn't affects words in query,
so you can't discard them from query.

> -Sushant.
>
> On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
>> Sushant,
>>
>> first, please, provide simple test queries, which demonstrate the right work
>> in the corner cases. This will helps reviewers to test your patch and
>> helps you to make sure your new version is ok. For example:
>>
>> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery);
>> ts_headline
>> ------------------------------------------------------
>> <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b>
>>
>> This select breaks your code:
>>
>> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
>> ts_headline
>> --------------
>> ... 2 ...
>>
>> and so on ....
>>
>>
>> Oleg
>> On Tue, 15 Jul 2008, Sushant Sinha wrote:
>>
>>> Attached a new patch that:
>>>
>>> 1. fixes previous bug
>>> 2. better handles the case when cover size is greater than the MaxWords.
>>> Basically it divides a cover greater than MaxWords into fragments of
>>> MaxWords, resizes each such fragment so that each end of the fragment
>>> contains a query word and then evaluates best fragments based on number of
>>> query words in each fragment. In case of tie it picks up the smaller
>>> fragment. This allows more query words to be shown with multiple fragments
>>> in case a single cover is larger than the MaxWords.
>>>
>>> The resizing of a fragment such that each end is a query word provides room
>>> for stretching both sides of the fragment. This (hopefully) better presents
>>> the context in which query words appear in the document. If a cover is
>>> smaller than MaxWords then the cover is treated as a fragment.
>>>
>>> Let me know if you have any more suggestions or anything is not clear.
>>>
>>> I have not yet added the regression tests. The regression test suite seemed
>>> to be only ensuring that the function works. How many tests should I be
>>> adding? Is there any other place that I need to add different test cases for
>>> the function?
>>>
>>> -Sushant.
>>>
>>>
>>> Nice. But it will be good to resolve following issues:
>>>> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
>>>> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and load in db.
>>>>
>>>> Queries
>>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
>>>> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
>>>>
>>>> and
>>>>
>>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
>>>> from apod;
>>>>
>>>> crash postgresql :(
>>>>
>>>> 2) pls, include in your patch documentation and regression tests.
>>>>
>>>>
>>>>> Another change that I was thinking:
>>>>>
>>>>> Right now if cover size > max_words then I just cut the trailing words.
>>>>> Instead I was thinking that we should split the cover into more
>>>>> fragments such that each fragment contains a few query words. Then each
>>>>> fragment will not contain all query words but will show more occurrences
>>>>> of query words in the headline. I would like to know what your opinion
>>>>> on this is.
>>>>>
>>>>
>>>> Agreed.
>>>>
>>>>
>>>> --
>>>> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
>>>> WWW:
>>>> http://www.sigaev.ru/
>>>>
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-18 01:16:24
Message-ID: 1216343784.10058.5.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Fixed some off by one errors pointed by Oleg and errors in excluding
overlapping fragments.

Also adding test queries and updating regression tests.

Let me know of any other changes that are needed.

-Sushant.

On Thu, 2008-07-17 at 03:28 +0400, Oleg Bartunov wrote:
> On Wed, 16 Jul 2008, Sushant Sinha wrote:
>
> > I will add test queries and their results for the corner cases in a
> > separate file. I guess the only thing I am confused about is what should
> > be the behavior of headline generation when Query items have words of
> > size less than ShortWord. I guess the answer is to ignore ShortWord
> > parameter but let me know if the answer is any different.
> >
>
> ShortWord is about headline text, it doesn't affects words in query,
> so you can't discard them from query.
>
> > -Sushant.
> >
> > On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
> >> Sushant,
> >>
> >> first, please, provide simple test queries, which demonstrate the right work
> >> in the corner cases. This will helps reviewers to test your patch and
> >> helps you to make sure your new version is ok. For example:
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery);
> >> ts_headline
> >> ------------------------------------------------------
> >> <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b>
> >>
> >> This select breaks your code:
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
> >> ts_headline
> >> --------------
> >> ... 2 ...
> >>
> >> and so on ....
> >>
> >>
> >> Oleg
> >> On Tue, 15 Jul 2008, Sushant Sinha wrote:
> >>
> >>> Attached a new patch that:
> >>>
> >>> 1. fixes previous bug
> >>> 2. better handles the case when cover size is greater than the MaxWords.
> >>> Basically it divides a cover greater than MaxWords into fragments of
> >>> MaxWords, resizes each such fragment so that each end of the fragment
> >>> contains a query word and then evaluates best fragments based on number of
> >>> query words in each fragment. In case of tie it picks up the smaller
> >>> fragment. This allows more query words to be shown with multiple fragments
> >>> in case a single cover is larger than the MaxWords.
> >>>
> >>> The resizing of a fragment such that each end is a query word provides room
> >>> for stretching both sides of the fragment. This (hopefully) better presents
> >>> the context in which query words appear in the document. If a cover is
> >>> smaller than MaxWords then the cover is treated as a fragment.
> >>>
> >>> Let me know if you have any more suggestions or anything is not clear.
> >>>
> >>> I have not yet added the regression tests. The regression test suite seemed
> >>> to be only ensuring that the function works. How many tests should I be
> >>> adding? Is there any other place that I need to add different test cases for
> >>> the function?
> >>>
> >>> -Sushant.
> >>>
> >>>
> >>> Nice. But it will be good to resolve following issues:
> >>>> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
> >>>> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and load in db.
> >>>>
> >>>> Queries
> >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >>>> from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
> >>>>
> >>>> and
> >>>>
> >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
> >>>> from apod;
> >>>>
> >>>> crash postgresql :(
> >>>>
> >>>> 2) pls, include in your patch documentation and regression tests.
> >>>>
> >>>>
> >>>>> Another change that I was thinking:
> >>>>>
> >>>>> Right now if cover size > max_words then I just cut the trailing words.
> >>>>> Instead I was thinking that we should split the cover into more
> >>>>> fragments such that each fragment contains a few query words. Then each
> >>>>> fragment will not contain all query words but will show more occurrences
> >>>>> of query words in the headline. I would like to know what your opinion
> >>>>> on this is.
> >>>>>
> >>>>
> >>>> Agreed.
> >>>>
> >>>>
> >>>> --
> >>>> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> >>>> WWW:
> >>>> http://www.sigaev.ru/
> >>>>
> >>>
> >>
> >> Regards,
> >> Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

Attachment Content-Type Size
headlines_v0.8.patch text/x-patch 12.7 KB
headlines_test.txt text/x-vhdl 8.3 KB
headlines_regressv0.2.patch text/x-patch 4.5 KB

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-23 11:36:28
Message-ID: 488717BC.5030907@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> Let me know of any other changes that are needed.

Looks like ready to commit, but documentation is needed.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: sushant354(at)gmail(dot)com, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-23 11:44:27
Message-ID: Pine.LNX.4.64.0807231543080.11363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

btw, is it intentional to have '....' in headline ?

=# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1');
ts_headline
-------------------------
... <b>4</b> 5 <b>1</b>

Oleg

On Wed, 23 Jul 2008, Teodor Sigaev wrote:

>> Let me know of any other changes that are needed.
>
> Looks like ready to commit, but documentation is needed.
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Sushant Sinha" <sushant354(at)gmail(dot)com>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Pierre-Yves Strub" <pierre(dot)yves(dot)strub(at)gmail(dot)com>, "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-23 16:17:26
Message-ID: 9fb559330807230917h19b61057p3d5fb59601fd9522@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I guess it is more readable to add cover separator at the end of a fragment
than in the front. Let me know what you think and I can update it.

I think the right place for cover separator is in the structure
HeadlineParsedText just like startsel and stopsel. This will enable users to
specify their own cover separators. But this will require changes to the
structure as well as to the generateHeadline function. This option will not
also play well with the default headline generation function.

The default MaxWords = 35 seems a bit high for this headline generation
function and 20 seems to be more reasonable. Any thoughts?

-Sushant.

On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:

> btw, is it intentional to have '....' in headline ?
>
> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1');
> ts_headline
> -------------------------
> ... <b>4</b> 5 <b>1</b>
>
>
>
> Oleg
>
> On Wed, 23 Jul 2008, Teodor Sigaev wrote:
>
> Let me know of any other changes that are needed.
>>>
>>
>> Looks like ready to commit, but documentation is needed.
>>
>>
>>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/<http://www.sai.msu.su/%7Emegera/>
> phone: +007(495)939-16-83, +007(495)939-23-83
>


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Sushant Sinha <sushant354(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-07-23 20:28:46
Message-ID: Pine.LNX.4.64.0807232348100.11363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Wed, 23 Jul 2008, Sushant Sinha wrote:

> I guess it is more readable to add cover separator at the end of a fragment
> than in the front. Let me know what you think and I can update it.

FragmentsDelimiter should *separate* fragments and that says all.
Not very difficult algorithmic problem, it's like perl's
join(FragmentsDelimiter, @array)

>
> I think the right place for cover separator is in the structure
> HeadlineParsedText just like startsel and stopsel. This will enable users to
> specify their own cover separators. But this will require changes to the
> structure as well as to the generateHeadline function. This option will not
> also play well with the default headline generation function.

As soon as we introduce FragmentsDelimiter we should make it
configurable.

>
> The default MaxWords = 35 seems a bit high for this headline generation
> function and 20 seems to be more reasonable. Any thoughts?

I think we should not change default value because it could change
behaviour of existing applications. I'm not sure if it'd be useful and
possible to define default values in CREATE TEXT SEARCH PARSER

>
> -Sushant.
>
> On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>
>> btw, is it intentional to have '....' in headline ?
>>
>> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1');
>> ts_headline
>> -------------------------
>> ... <b>4</b> 5 <b>1</b>
>>
>>
>>
>> Oleg
>>
>> On Wed, 23 Jul 2008, Teodor Sigaev wrote:
>>
>> Let me know of any other changes that are needed.
>>>>
>>>
>>> Looks like ready to commit, but documentation is needed.
>>>
>>>
>>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/<http://www.sai.msu.su/%7Emegera/>
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2008-08-02 15:50:18
Message-ID: 1217692218.6000.8.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Sorry for the delay. Here is the patch with FragmentDelimiter option.
It requires an extra option in HeadlineParsedText and uses that option
during generateHeadline.

Implementing notion of fragments in HeadlineParsedText and a separate
function to join them seems more complicated. So for the time being I
just dump a FragmentDelimiter whenever a new fragment (other than the
first one) starts.

The patch also contains the updated regression tests/results and also a
new test for FragmentDelimiter option. It also contains the
documentation for the new options.

I have also attached a separate file that tests different aspects of the
new headline generation function.

Let me know if anything else is needed.

-Sushant.

On Thu, 2008-07-24 at 00:28 +0400, Oleg Bartunov wrote:
> On Wed, 23 Jul 2008, Sushant Sinha wrote:
>
> > I guess it is more readable to add cover separator at the end of a fragment
> > than in the front. Let me know what you think and I can update it.
>
> FragmentsDelimiter should *separate* fragments and that says all.
> Not very difficult algorithmic problem, it's like perl's
> join(FragmentsDelimiter, @array)
>
> >
> > I think the right place for cover separator is in the structure
> > HeadlineParsedText just like startsel and stopsel. This will enable users to
> > specify their own cover separators. But this will require changes to the
> > structure as well as to the generateHeadline function. This option will not
> > also play well with the default headline generation function.
>
> As soon as we introduce FragmentsDelimiter we should make it
> configurable.
>
> >
> > The default MaxWords = 35 seems a bit high for this headline generation
> > function and 20 seems to be more reasonable. Any thoughts?
>
> I think we should not change default value because it could change
> behaviour of existing applications. I'm not sure if it'd be useful and
> possible to define default values in CREATE TEXT SEARCH PARSER
>
> >
> > -Sushant.
> >
> > On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> >
> >> btw, is it intentional to have '....' in headline ?
> >>
> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1');
> >> ts_headline
> >> -------------------------
> >> ... <b>4</b> 5 <b>1</b>
> >>
> >>
> >>
> >> Oleg
> >>
> >> On Wed, 23 Jul 2008, Teodor Sigaev wrote:
> >>
> >> Let me know of any other changes that are needed.
> >>>>
> >>>
> >>> Looks like ready to commit, but documentation is needed.
> >>>
> >>>
> >>>
> >> Regards,
> >> Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/<http://www.sai.msu.su/%7Emegera/>
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >>
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

Attachment Content-Type Size
headlines_v0.9.patch text/x-patch 23.1 KB
headlines_test.txt text/x-vhdl 8.4 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: sushant354(at)gmail(dot)com
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2009-04-14 00:57:07
Message-ID: 26129.1239670627@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> Sorry for the delay. Here is the patch with FragmentDelimiter option.
> It requires an extra option in HeadlineParsedText and uses that option
> during generateHeadline.

I did some editing of the documentation for this patch and noticed that
the explanation of the fragment-based headline method says

If not all query words are found in the
document, then a single fragment of the first <literal>MinWords</>
in the document will be displayed.

(That's what it says now, that is, based on my editing and testing of
the original.) This seems like a pretty dumb fallback approach ---
if you have only a partial match, the headline generation suddenly
becomes about as stupid as it could possibly be. I could understand
doing the above if the text actually contains *none* of the query
words, but surely if it contains some of them we should still select
fragments centered on those words.

regards, tom lane


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2009-04-14 01:18:50
Message-ID: 1239671930.6379.3.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Headline generation uses hlCover to get fragments in text with *all*
query items. In case there is no such fragment, it does not return
anything.

What you are asking will either require returning *maximally* matching
covers or handling it as a separate case.

-Sushant.

On Mon, 2009-04-13 at 20:57 -0400, Tom Lane wrote:
> Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> > Sorry for the delay. Here is the patch with FragmentDelimiter option.
> > It requires an extra option in HeadlineParsedText and uses that option
> > during generateHeadline.
>
> I did some editing of the documentation for this patch and noticed that
> the explanation of the fragment-based headline method says
>
> If not all query words are found in the
> document, then a single fragment of the first <literal>MinWords</>
> in the document will be displayed.
>
> (That's what it says now, that is, based on my editing and testing of
> the original.) This seems like a pretty dumb fallback approach ---
> if you have only a partial match, the headline generation suddenly
> becomes about as stupid as it could possibly be. I could understand
> doing the above if the text actually contains *none* of the query
> words, but surely if it contains some of them we should still select
> fragments centered on those words.
>
> regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: sushant354(at)gmail(dot)com
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] Fragments in tsearch2 headline
Date: 2009-04-14 01:24:58
Message-ID: 26620.1239672298@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> Headline generation uses hlCover to get fragments in text with *all*
> query items. In case there is no such fragment, it does not return
> anything.

> What you are asking will either require returning *maximally* matching
> covers or handling it as a separate case.

Efficiently being useless is still useless --- a headline selection
function needs to be robust, not fragile, and not doing anything useful
for a partial match sounds pretty fragile to me.

regards, tom lane