possible bug in cover density ranking?

Lists: pgsql-hackers
From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: possible bug in cover density ranking?
Date: 2009-01-29 06:07:32
Message-ID: 1233209252.18692.24.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I am running postgres 8.3.1. In tsrank.c I am looking at the cover
density function used for ranking while doing text search:
float4
calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method)

Here is the excerpt of code that I think may possibly have bug when
document is big enough to exceed the 16383 position limit.

CODE
=======================================================
Cpos = ((double) (ext.end - ext.begin + 1)) / InvSum;

/*
* if doc are big enough then ext.q may be equal to ext.p due to limit
* of posional information. In this case we approximate number of
* noise word as half cover's length
*/
nNoise = (ext.q - ext.p) - (ext.end - ext.begin);
if (nNoise < 0)
nNoise = (ext.end - ext.begin) / 2
Wdoc += Cpos / ((double) (1 + nNoise));
=======================================================

As per my understanding, ext.end -ext.begin + 1 is the number of query
items in the cover and ext.q-ext.p says the length of the cover.

So consider a query with two query items. When we run out of position
information, Cover returns ext.q = 16383 and ext.p = 16383 and the
number of query items= ext.end-ext-begin + 1 = 2

nNoise becomes -1 and then nNoise is initialized to (ext.end
-ext.begin)/2 = 0
Wdoc becomes Cpos = 2/InvSum = 2/(1/0.1+1/0.1) = 0.1

Is this what is desired? It seems to me that Wdoc is getting a high
ranking even when we are not sure of the position information.

The comment above says that "In this case we approximate number of
noise word as half cover's length". But we do not know the cover's
length in this case as ext.p and ext.q are both unreliable. And ext.end
-ext.begin is not the "cover's length". It is the number of query items
found in the cover.

Any clarification would be useful.

Thanks,
-Sushant.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: sushant354(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: possible bug in cover density ranking?
Date: 2009-01-29 17:38:25
Message-ID: 4981E991.3030002@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Is this what is desired? It seems to me that Wdoc is getting a high
> ranking even when we are not sure of the position information.
0.1 is not very high rank, and we could not suggest any reasonable rank in this
case. This document may be good, may be bad. rank_cd is not limited by 1.

>
> The comment above says that "In this case we approximate number of
> noise word as half cover's length". But we do not know the cover's
> length in this case as ext.p and ext.q are both unreliable. And ext.end
> -ext.begin is not the "cover's length". It is the number of query items
> found in the cover.

Yeah, but if there is no information then information is absent :), but I agree
with you to change comment
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: possible bug in cover density ranking?
Date: 2009-01-29 18:54:12
Message-ID: 9fb559330901291054l13164cf8ia45caa1ea52cdc96@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:

> Is this what is desired? It seems to me that Wdoc is getting a high
>> ranking even when we are not sure of the position information.
>>
> 0.1 is not very high rank, and we could not suggest any reasonable rank in
> this case. This document may be good, may be bad. rank_cd is not limited by
> 1.

For a cover of 2 query items, 0.1 is actually the maximum rank. This is only
possible when both query items are adjacent to each other.

0.1 may not seem too high when we look at its absoule value. But the problem
is we are ranking a document for which we have no positional information
available higher than a document for which we may have positional
information available with let suppose the cover length of 3. I think we
should rank the document with cover length 3 higher than the document for
which we have no positional information (and assume cover length of 2 as we
are doing now).

I feel that if ext.p=ext.q for query items > 1, then we should not count
that cover for ranking at all. Or, another option will be to significantly
inflate nNoise in this scenrio to say 100. Putting
nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no idea
on (it is 0 for query items = 2).

I am not assuming or suggesting that rank_cd is bounded by one. Off course
its rank increases as more and more covers are added.

Thanks,
Sushant.

>
>
>
>> The comment above says that "In this case we approximate number of
>> noise word as half cover's length". But we do not know the cover's
>> length in this case as ext.p and ext.q are both unreliable. And ext.end
>> -ext.begin is not the "cover's length". It is the number of query items
>> found in the cover.
>>
>
> Yeah, but if there is no information then information is absent :), but I
> agree with you to change comment
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW:
> http://www.sigaev.ru/
>


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: possible bug in cover density ranking?
Date: 2009-05-02 01:20:34
Message-ID: 1241227234.4633.1.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I see this as open items here

http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items

Any interest in fixing this?

-Sushant.

On Thu, 2009-01-29 at 13:54 -0500, Sushant Sinha wrote:
>
>
> On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev <teodor(at)sigaev(dot)ru>
> wrote:
> Is this what is desired? It seems to me that Wdoc is
> getting a high
> ranking even when we are not sure of the position
> information.
> 0.1 is not very high rank, and we could not suggest any
> reasonable rank in this case. This document may be good, may
> be bad. rank_cd is not limited by 1.
>
>
> For a cover of 2 query items, 0.1 is actually the maximum rank. This
> is only possible when both query items are adjacent to each other.
>
> 0.1 may not seem too high when we look at its absoule value. But the
> problem is we are ranking a document for which we have no positional
> information available higher than a document for which we may have
> positional information available with let suppose the cover length of
> 3. I think we should rank the document with cover length 3 higher than
> the document for which we have no positional information (and assume
> cover length of 2 as we are doing now).
>
> I feel that if ext.p=ext.q for query items > 1, then we should not
> count that cover for ranking at all. Or, another option will be to
> significantly inflate nNoise in this scenrio to say 100. Putting
> nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no
> idea on (it is 0 for query items = 2).
>
> I am not assuming or suggesting that rank_cd is bounded by one. Off
> course its rank increases as more and more covers are added.
>
> Thanks,
> Sushant.
>
>
>
> The comment above says that "In this case we
> approximate number of
> noise word as half cover's length". But we do not know
> the cover's
> length in this case as ext.p and ext.q are both
> unreliable. And ext.end
> -ext.begin is not the "cover's length". It is the
> number of query items
> found in the cover.
>
>
> Yeah, but if there is no information then information is
> absent :), but I agree with you to change comment
> --
> Teodor Sigaev E-mail:
> teodor(at)sigaev(dot)ru
> WWW:
> http://www.sigaev.ru/
>