Re: Term positions in GIN fulltext index

Lists: pgsql-hackers
From: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Term positions in GIN fulltext index
Date: 2011-11-03 15:52:23
Message-ID: 4EB2B8B7.1060806@univ-avignon.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,
I'm using a GIN index for a text column on a big table. I use it to rank
the rows, but I also need to get the term positions for each document of a
subset of documents for one or more terms. I suppose these positions are stored
in the index as the to_tsvector shows them : 'lexeme':{positions}

I've searched and asked on general postgresql mailing list, and I assume
there is no simple way to get these term positions.

For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) :
'I get lexemes and I get term positions.'
'Did you get the positions ?'

I'd need a function like this :
select term_positions(text, 'get') from docs;
id_doc | positions
--------+-----------
1 | {2,6}
2 | {3}

I'd like to add this function in my database, for experimental purpose.
I got a look at the source code but didn't find some code example using the GIN index ;
I can not figure out where the GIN index is read as a tsvector
or where the '@@' operator gets the matching tsvectors for the terms of the tsquery.

Any help about where to start reading would be very welcome :)

Regards,
Yoann Moreau


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>, "Yoann Moreau" <yoann(dot)moreau(at)univ-avignon(dot)fr>
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-03 17:29:38
Message-ID: 4EB289320200002500042992@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr> wrote:

> I'd need a function like this :
> select term_positions(text, 'get') from docs;
> id_doc | positions
> --------+-----------
> 1 | {2,6}
> 2 | {3}
>
> I'd like to add this function in my database, for experimental
> purpose. I got a look at the source code but didn't find some code
> example using the GIN index ;
> I can not figure out where the GIN index is read as a tsvector
> or where the '@@' operator gets the matching tsvectors for the
> terms of the tsquery.
>
> Any help about where to start reading would be very welcome :)

I'm not really clear on what you want to read about. Do you need
help creating your own function on the fly, or with how to access
the information to write the function?

If the former, these links might help:

http://www.postgresql.org/docs/9.1/interactive/extend.html

http://www.postgresql.org/docs/9.1/interactive/sql-createfunction.html

If the latter, have you looked at this file?:

src/backend/utils/adt/tsrank.c

Or was it something else that I'm missing?

-Kevin


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-03 18:19:14
Message-ID: 20322C15-13A6-41C2-8860-BED67328378D@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Nov3, 2011, at 16:52 , Yoann Moreau wrote:
> I'm using a GIN index for a text column on a big table. I use it to rank
> the rows, but I also need to get the term positions for each document of a
> subset of documents for one or more terms. I suppose these positions are stored
> in the index as the to_tsvector shows them : 'lexeme':{positions}

There's a difference between values of type tsvector, and what GIN indices
on columns or expressions of type tsvector store.

Values of type tsvector, of course, store weights and positions for each lexem.

But GIN indices store only the bare lexems without weights and positions. In
general, GIN indices work by extracting "elements" from values to be indexed,
and store these "elements" in a btree, together with pointers to the rows
containing the indexed values.

Thus, if you created a function index on the results of to_tsvector, i.e.
if you do
CREATE INDEX gin_idx ON docs USING gin (to_tsvector(text))
then the weights and positions aren't stored anywhere - they'll only exists in
the transient, in-memory tsvector value that to_tsvector returns, but not in
the on-disk GIN index gin_idx.

For the positions and weights to be store, you need to store the result of
to_tsvector in a column of type tsvector, say text_tsvector, and create the
index as
CREATE INDEX gin_idx ON docs USING gin (text_tsvector)

The GIN index gin_idx still won't store weights and positions, but the column
text_tsvector will.

> For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) :
> 'I get lexemes and I get term positions.'
> 'Did you get the positions ?'
>
> I'd need a function like this :
> select term_positions(text, 'get') from docs;
> id_doc | positions
> --------+-----------
> 1 | {2,6}
> 2 | {3}

As I pointed out above, you'll first need to make sure to store the result of
to_tsvector in a columns. Then, what you need seems to be a functions that
takes a tsvector value and returns the contained lexems as individual rows.

Postgres doesn't seem to contain such a function currently (don't believe that,
though - go and recheck the documentation. I don't know all thousands of built-in
functions by heart). But it's easy to add one. You could either use PL/pgSQL
to parse the tsvector's textual representation, or write a C function. If you
go the PL/pgSQL route, regexp_split_to_table() might come in handy.

> I'd like to add this function in my database, for experimental purpose.
> I got a look at the source code but didn't find some code example using the GIN index ;
> I can not figure out where the GIN index is read as a tsvector
> or where the '@@' operator gets the matching tsvectors for the terms of the tsquery.

The basic flow of information is:

to_tsvector takes a string, parses and, applies various dictionaries according
to the textsearch configuration, and finally returns a value of type tsvector.
See the files names tsvector* for the implementation of that process, and for
the implementation of the various support functions which work on values of type
tsvector.

The GIN index machinery then calls the tsvector's extractValue() function to extract
the "elements" mentioned above from the tsvector value. That function is called
gin_extract_tsvector() and lives in tsginidx.c. The extracted "elements" are
then added to the GIN index's internal btree.

During query execution, if postgres sees that the operator tsvector @@ tsquery
is used, and that the left argument is a GIN-indexed column, it will use the
extractQuery() and consistent() functions to quickly find matching rows by
scanning the internal btree index. In the case of tsvector and tsquery, the
implementation of these functions are gin_extract_tsquery() and
gin_tsquery_consistent(), found also in tsginidx.c.

I suggest you read http://www.postgresql.org/docs/9.1/interactive/gin.html,
it explains all of this in (much) more detail.

best regards,
Florian Pflug


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-03 19:01:29
Message-ID: 7024.1320346889@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr> writes:
> I'm using a GIN index for a text column on a big table. I use it to rank
> the rows, but I also need to get the term positions for each document of a
> subset of documents for one or more terms. I suppose these positions are stored
> in the index as the to_tsvector shows them : 'lexeme':{positions}

I'm pretty sure that a GIN index on tsvector does *not* store positions
--- it only knows about the strings. Don't know one way or the other
about GIST.

regards, tom lane


From: Marcin Mańk <marcin(dot)mank(at)gmail(dot)com>
To: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-03 19:34:15
Message-ID: CAK61fk6-oG3s+wK3aJXwrbk5uAEspu__-pGc+D3TBzyH_nCDmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 3, 2011 at 4:52 PM, Yoann Moreau
<yoann(dot)moreau(at)univ-avignon(dot)fr> wrote:
> I'd need a function like this :
> select term_positions(text, 'get') from docs;
>  id_doc | positions
> --------+-----------
>      1 |     {2,6}
>      2 |       {3}
>

check this out:
http://www.postgresql.org/docs/current/static/textsearch-debugging.html
ts_debug does what You want, and more. Look at it's source - it`s a
plain sql function, You can make something based on it.

Greetings
Marcin Mańk


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-03 19:40:13
Message-ID: CAPpHfdtVWHJR3C5WKYHPxBajT5mtmGPKbdOQGrhxGTcACeNZ-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 3, 2011 at 11:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr> writes:
> > I'm using a GIN index for a text column on a big table. I use it to rank
> > the rows, but I also need to get the term positions for each document of
> a
> > subset of documents for one or more terms. I suppose these positions are
> stored
> > in the index as the to_tsvector shows them : 'lexeme':{positions}
>
> I'm pretty sure that a GIN index on tsvector does *not* store positions
> --- it only knows about the strings. Don't know one way or the other
> about GIST.
>
GiST index doesn't store positions too. See gtsvector_compress. It converts
tsvector to array of crc32 of words. If that value is anyway too large then
function converts it to signature.

------
With best regards,
Alexander Korotkov.


From: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-04 10:15:15
Message-ID: 4EB3BB33.9080801@univ-avignon.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/11/11 19:19, Florian Pflug wrote:
> There's a difference between values of type tsvector, and what GIN indices
> on columns or expressions of type tsvector store.

I was wondering what was the point about storing the tsvector in the
table, I now understand. I then should use the GIN index to rank my
documents, and work on the stored tsvectors for positions.

> As I pointed out above, you'll first need to make sure to store the result of
> to_tsvector in a columns. Then, what you need seems to be a functions that
> takes a tsvector value and returns the contained lexems as individual rows.
>
> Postgres doesn't seem to contain such a function currently (don't believe that,
> though - go and recheck the documentation. I don't know all thousands of built-in
> functions by heart). But it's easy to add one. You could either use PL/pgSQL
> to parse the tsvector's textual representation, or write a C function. If you
> go the PL/pgSQL route, regexp_split_to_table() might come in handy.

This seems easier to program than what I was thinking about, I'm going
to do that. But I'm wondering about size of database with the GIN index
plus the tsvector column, and performance about parsing the whole
tsvectors for each document I need positions from (as I need them for a
very few terms).

Maybe some external fulltext engine managing lexemes and positions would
be more efficient for my purpose. I'll try some different things and let
you know the results.

Thanks all for your help
Regards,
Yoann Moreau


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-04 11:15:56
Message-ID: 25F8CB23-35D6-481A-8AC6-F8396838D7C8@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Nov4, 2011, at 11:15 , Yoann Moreau wrote:
> On 03/11/11 19:19, Florian Pflug wrote:
>> Postgres doesn't seem to contain such a function currently (don't believe that,
>> though - go and recheck the documentation. I don't know all thousands of built-in
>> functions by heart). But it's easy to add one. You could either use PL/pgSQL
>> to parse the tsvector's textual representation, or write a C function. If you
>> go the PL/pgSQL route, regexp_split_to_table() might come in handy.
>
> This seems easier to program than what I was thinking about, I'm going to do that.
> But I'm wondering about size of database with the GIN index plus the tsvector column,
> and performance about parsing the whole tsvectors for each document I need positions
> from (as I need them for a very few terms).

AFAICS, the internal storage layout of tsvector should allow you to extract an
individual lexem's positions quite efficiently (with time complexity log(N) where
N is the number of lexems in the tsvector). Doing so will require you to implement
your function in C though - any solution that works from a tsvector's textual
representation will obviously have time complexity N.

best regards,
Florian Pflug


From: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Term positions in GIN fulltext index
Date: 2011-11-04 14:26:01
Message-ID: 4EB3F5F9.4020909@univ-avignon.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/11/11 12:15, Florian Pflug wrote:
> AFAICS, the internal storage layout of tsvector should allow you to extract an
> individual lexem's positions quite efficiently (with time complexity log(N) where
> N is the number of lexems in the tsvector). Doing so will require you to implement
> your function in C though - any solution that works from a tsvector's textual
> representation will obviously have time complexity N.
>
> best regards,
> Florian Pflug
>
I'll do a pl/pgsql function first, I need to test it with other parts of
the project. But I will look for more efficient algorithms for a C
function as soon as possible if we still decide to use the postgresql
fulltext engine.

Regards,
Yoann Moreau