gsoc, oprrest function for text search

Lists: pgsql-hackers
From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject: gsoc, oprrest function for text search
Date: 2008-07-19 11:15:27
Message-ID: 4881CCCF.4020905@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's a WIP patch implementing an oprrest function for tsvector @@
tsquery and tsquery @@ tsvector.

The idea is (quoting a comment)
/*
* Traverse the tsquery preorder, calculating selectivity as:
*
* selec(left_oper) * selec(right_oper) in AND nodes,
*
* selec(left_oper) + selec(right_oper) -
* selec(left_oper) * selec(right_oper) in OR nodes,
*
* 1 - select(oper) in NOT nodes
*
* freq[val] in VAL nodes, if the value is in MCELEM
* min(freq[MCELEM]) / 2 in VAL nodes, if it is not
*
*
* Implementation-wise, we sort the MCELEM array to use binary
* search on it.
*/

The patch still has many rough edges, but it applies to HEAD and passes
tests. I'm posting it mostly to get feedback about whether I'm going in
the right direction.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment Content-Type Size
tssel-gsoc08-tss.patch text/plain 13.0 KB

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject: Re: gsoc, oprrest function for text search
Date: 2008-07-19 12:40:59
Message-ID: 4881E0DB.9070402@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Urbański wrote:
> The idea is (quoting a comment)
> /*
> * Traverse the tsquery preorder, calculating selectivity as:

Ekhm.
This should of course read "postorder"...

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc: "Postgres - Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, oprrest function for text search
Date: 2008-07-29 07:23:49
Message-ID: 488EC585.8020502@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Urbański wrote:
> Here's a WIP patch implementing an oprrest function for tsvector @@
> tsquery and tsquery @@ tsvector.
>
> The idea is (quoting a comment)
> /*
> * Traverse the tsquery preorder, calculating selectivity as:
> *
> * selec(left_oper) * selec(right_oper) in AND nodes,
> *
> * selec(left_oper) + selec(right_oper) -
> * selec(left_oper) * selec(right_oper) in OR nodes,
> *
> * 1 - select(oper) in NOT nodes
> *
> * freq[val] in VAL nodes, if the value is in MCELEM
> * min(freq[MCELEM]) / 2 in VAL nodes, if it is not

Seems reasonable.

> *
> * Implementation-wise, we sort the MCELEM array to use binary
> * search on it.
> */

Would it be possible to store the array in sorted order, to avoid
sorting it on every invocation of tssel?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, oprrest function for text search
Date: 2008-07-29 07:27:11
Message-ID: 488EC64F.20701@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Jan Urbański wrote:
>> Here's a WIP patch implementing an oprrest function for tsvector @@
>> tsquery and tsquery @@ tsvector.
>>
>> The idea is (quoting a comment)
>> /*
>> * Traverse the tsquery preorder, calculating selectivity as:
>> *
>> * selec(left_oper) * selec(right_oper) in AND nodes,
>> *
>> * selec(left_oper) + selec(right_oper) -
>> * selec(left_oper) * selec(right_oper) in OR nodes,
>> *
>> * 1 - select(oper) in NOT nodes
>> *
>> * freq[val] in VAL nodes, if the value is in MCELEM
>> * min(freq[MCELEM]) / 2 in VAL nodes, if it is not
>
> Seems reasonable.
>
>> *
>> * Implementation-wise, we sort the MCELEM array to use binary
>> * search on it.
>> */
>
> Would it be possible to store the array in sorted order, to avoid
> sorting it on every invocation of tssel?

It's being stored sorted on frequencies, like so:
[('dog', 0.9), ('cat', 0.8), ('sheep', 0.7)]
and I need it sorted on elements for bsearch().

I don't know if it's OK to break the rule that statistical data is
stored sorted on freqneucies. If so, then ts_typanalyze() would have to
change and do one more qsort() before storing the result.

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin