Re: building tsquery directly in memory (avoid makepol)

Lists: pgsql-hackers
From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-hackers(at)postgresql(dot)org
Subject: building tsquery directly in memory (avoid makepol)
Date: 2010-02-04 18:24:02
Message-ID: 20100204192402.1a9e9a73@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I know in advance the structure of a whole tsquery, it has already
been reduced and lexemes have been already computed.
I'd like to directly write it in memory without having to pass
through pushValue/makepol.

Anyway I'm not pretty sure about what is the layout of a tsquery in
memory and I still haven't been able to find the MACRO that could
help me [1].

Before doing it the trial and error way can somebody just make me an
example?
I'm not pretty sure about my interpretation of the comments of the
documentation.

This is how I'd write
X:AB | YY:C | ZZZ:D

TSQuery
vl_len_ (total # of bytes of the whole following structure
QueryItems*size + total lexeme length)
size (# of QueryItems in the query)
QueryItem
type QI_OPR
oper OP_OR
left -> distance from QueryItem X:AB
QueryItem
type QI_OPR
oper OP_OR
left -> distance from QueryItem ZZZ:D
QueryItem (X)
type QI_VAL
weight 1100
valcrc ???
lenght 1
distance
QueryItem (YY)
type QI_VAL
weight 0010
valcrc ???
lenght 2
distance
QueryItem (ZZZ)
type QI_VAL
weight 0001
valcrc ???
lenght 3
distance
X
YY
ZZZ

[1] the equivalent of POSTDATALEN, WEP_GETWEIGHT, macro to compute
the size of various parts of TSQuery etc...

I couldn't see any place in the code where TSQuery is built in "one
shot" in spite of using pushValue.

Another thing I'd like to know is: what is going to be preferred
during a scan between
'java:1A,2B '::tsvector @@ to_tsquery('java:A | java:B');
vs.
'java:1A,2B '::tsvector @@ to_tsquery('java:AB')
?
they look equivalent. Are they?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: building tsquery directly in memory (avoid makepol)
Date: 2010-02-04 19:13:02
Message-ID: 4B6B1C3E.9060202@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Before doing it the trial and error way can somebody just make me an
> example?
> I'm not pretty sure about my interpretation of the comments of the
> documentation.
> TSQuery
[skipped]
Right, valcrc is computed in pushValue

> I couldn't see any place in the code where TSQuery is built in "one
> shot" in spite of using pushValue.
That because in all places we could parse rather complex structure. Simple OR-ed
query could be hardcoded as
pushValue('X')
pushValue('YY')
pushOperator(OP_OR);
pushValue('ZZZ')
pushOperator(OP_OR);

You need to call pushValue/pushOperator imagery order of polish notation.
Note, you can do another order:
pushValue('X')
pushValue('YY')
pushValue('ZZZ')
pushOperator(OP_OR);
pushOperator(OP_OR);

So, first example will produce ( X | YY ) | ZZZ, second one X | ( YY | XXX )

>
> Another thing I'd like to know is: what is going to be preferred
> during a scan between
> 'java:1A,2B '::tsvector @@ to_tsquery('java:A | java:B');
> vs.
> 'java:1A,2B '::tsvector @@ to_tsquery('java:AB')
> ?
> they look equivalent. Are they?

Yes, but second one should be more efficient.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: building tsquery directly in memory (avoid makepol)
Date: 2010-02-05 02:12:55
Message-ID: 20100205031255.0ca275d2@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 04 Feb 2010 22:13:02 +0300
Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:

> > Before doing it the trial and error way can somebody just make
> > me an example?
> > I'm not pretty sure about my interpretation of the comments of
> > the documentation.
> > TSQuery
> [skipped]
> Right, valcrc is computed in pushValue

Anyway the structure I posted is correct, isn't it?
Is there any equivalent MACRO to POSTDATALEN, WEP_GETWEIGHT and
macro to know the memory size of a TSQuery?
I think I've seen MACRO that could help me to determine the size of
a TSQuery... but I haven't noticed anything like POSTDATALEN that
could come very handy to traverse a TSQuery.

I was thinking to skip pushValue and directly build the TSQuery in
memory since my queries have very simple structure and they are easy
to reduce...
Still it is not immediate to know the memory size in advance.
For OR queries it is easy but for AND queries I'll have to loop over
a tsvector, filter the weight according to a passed parameter and
see how many time I've to duplicate a lexeme for each weight.

eg.

tsvector_to_tsquery(
'pizza:1A,2B risotto:2C,4D barolo:5A,6C', '&', 'ACD'
);

should be turned into

pizza:A & risotto:C & risotto:D & barolo:A & barolo:C

I noticed you actually loop over the tsvector in tsvectorout to
allocate the memory for the string buffer and I was wondering if it
is really worth for my case as well.

Any good receipt in Moscow? ;)

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it