Quick Links

XML binary I/O (was Re: tsearch refactorings)

Lists:	pgsql-patches

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	tsearch refactorings
Date:	2007-09-03 11:46:10
Message-ID:	46DBF402.4030301@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

I've been slowly reading through the tsearch code and refactoring things
as I go. Here's a patch of what I've done this far. It has grown larger
than I intended, but it has helped me a lot to understand the code.

I hope I didn't break anything; at least it passes regression tests. If
there's some changes that others don't agree with or feel uncomfortable
doing for 8.3, let me know and I can adjust the patch.

The usage of the QueryItem struct was very confusing. It was used for
both operators and operands. For operators, "val" was a single character
casted to a int4, marking the operator type. For operands, val was the
CRC-32 of the value. Other fields were used only either for operands or
for operators. The biggest change in the patch is that I broke the
QueryItem struct into QueryOperator and QueryOperand. Type was really
the only common field between them. QueryItem still exists, and is used
in the TSQuery struct as before, but it's now a union of the two. Many
other changes fell from that, like separation of pushval_asis function
into pushValue, pushOperator and pushStop.

Other changes include:
- Moved some structs that were for internal use only from header files
to the right .c-files.

- Moved tsvector parser to a new tsvector_parser.c file. Parser code was
about half of the size of tsvector.c, it's also used from tsquery.c, and
it has some data structures of its own, so it seems better to separate
it. Cleaned up the API so that TSVectorParserState is not accessed from
outside tsvector_parser.c.

- Separated enumerations (#defines, really) used for QueryItem.type
field and as return codes from gettoken_query. It was just accidental
code sharing.

- Removed ParseQueryNode struct used internally by makepol and friends.
push*-functions now construct QueryItems directly.

- Changed int4 variables to just ints for variables like "i" or "array
size", where the storage-size was not significant.

- Probably a lot of other stuff I can't remember right now

I have a few more things in mind I'm planning to look into:

- I'm not convinced that there's enough sanity checking in the input
functions. I think you can send queries into the server using the binary
protocol that you couldn't produce otherwise. For example, queries with
multiple VAL nodes, with no operator to connect them. Does that wreak
havoc in any of the functions? And the left-field of QueryOperator is an
int16, what happens if you have query that exceeds
that? And parse_query always produces trees that are in prefix notation,
so the left-field is really redundant, but using tsqueryrecv, you could
inject queries that are not in prefix notation; is there anything in the
code that depends on that?

- There's many internal intermediate representations of a query:
TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix
notation stack of QueryItems (in parser), infix-tree. Could we remove
some of these?

- There's a lot of recursive functions. Are they all using
check_stack_depth?

- More commenting and documenting and little refactorings

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
tsearch-refactor-2.patch.gz	application/x-gzip	26.4 KB

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Cc:	"Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Subject:	Re: tsearch refactorings
Date:	2007-09-05 12:33:32
Message-ID:	46DEA21C.1030603@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

I wrote:
> I have a few more things in mind I'm planning to look into:
>
> - I'm not convinced that there's enough sanity checking in the input
> functions. I think you can send queries into the server using the binary
> protocol that you couldn't produce otherwise. For example, queries with
> multiple VAL nodes, with no operator to connect them. Does that wreak
> havoc in any of the functions?

I added code to check that the query tree is well-formed. It was indeed
possible to send malformed queries in binary mode, which produced all
kinds of strange results.

> And the left-field of QueryOperator is an
> int16, what happens if you have query that exceeds
> that?

Apparently the field wraps around and you get a backend crash when it
tries to do access an array using a negative index. You need to have a
large stack (and increase max_stack_depth now that we have the check for
that in there) to reproduce it. Not sure if it's an exploitable security
vulnerability, but it might be.

I fixed this by making the left-field a uint32. There's no reason to
arbitrarily limit it to 16-bits, and it won't increase the disk/memory
footprint either now that QueryOperator and QueryOperand are separate
structs.

> And parse_query always produces trees that are in prefix notation,
> so the left-field is really redundant, but using tsqueryrecv, you could
> inject queries that are not in prefix notation; is there anything in the
> code that depends on that?

At least infix-function seems to depend on it. By checking that the tree
is well-formed, and the fact that the left operand is implicitly the
next node in the array, we know that it is in prefix notation.

> - There's many internal intermediate representations of a query:
> TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix
> notation stack of QueryItems (in parser), infix-tree. Could we remove
> some of these?

I haven't looked into this yet. Might leave it for 8.4.

> - There's a lot of recursive functions. Are they all using
> check_stack_depth?

I added check_stack_depth() call to all recursive functions I found.
Some of them might have a natural limit so that you can't force
arbitrarily deep recursions, but check_stack_depth() is cheap enough
that seems best to just stick it into anything that might be a problem.

Patch attached. It's on top of the tsearch-refactor-2.patch I posted
earlier.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
tsearch-incr-1.patch	text/x-diff	11.9 KB

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-05 15:55:22
Message-ID:	46DED16A.9000505@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki, I see some strange changes in your patch, not related to tsearch at all:
contrib/pageinspect/pageinspect.sql.in
contrib/pageinspect/rawpage.c

> The usage of the QueryItem struct was very confusing. It was used for
> both operators and operands. For operators, "val" was a single character
> casted to a int4, marking the operator type. For operands, val was the
> CRC-32 of the value. Other fields were used only either for operands or
> for operators. The biggest change in the patch is that I broke the
> QueryItem struct into QueryOperator and QueryOperand. Type was really
...
> - Removed ParseQueryNode struct used internally by makepol and friends.
> push*-functions now construct QueryItems directly.

It's needed to set unused bytes in QueryItem to zero, it's common requiremens
for types in pgsql. After allocating space for tsquery in parse_tsquery you copy
just sizeof(QueryOperator) bytes and leave sizeof(QueryItem) -
sizeof(QueryOperator) bytes untouched. QueryOperand is a biggest component in
QueryItem union. I don't check other places.

> that? And parse_query always produces trees that are in prefix notation,
> so the left-field is really redundant, but using tsqueryrecv, you could
> inject queries that are not in prefix notation; is there anything in the
> code that depends on that?
It's used by TS_execute for optimization reason. With clear postfix notation you
should go through every nodes. For example:
FALSE FALSE & FALSE &
You will go to the end of query to produce correct result.
In fact, TSQuery is a prefix notation with pointer to another operand or, by
another words, just a plain view of tree where right operand of operation is
always placed after operation.
That notation allows to calculate only one of operand if it possible:
& FALSE & FALSE FALSE
1 2 3 4 5 --Nodes
After evaluating of second node you can return FALSE for whole expression and do
not evaluate nodes 3-5. For query
& TRUE & FALSE & FALSE
it's needed to evaluate 1,2,3,4 nodes. In most cases checking QI_VAL node is
much more expensive that QI_OPR

>
> - There's many internal intermediate representations of a query:
> TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix
> notation stack of QueryItems (in parser), infix-tree. Could we remove
> some of these?
I havn't strong objections, QTNode and NODE are tree-like structures, but
TSQuery is a postfix notation for storage in plain memory. NODE is used only
cleanup stop-word placeholders, so it's a binary tree while QTNode represents
t-ary tree (with any number of children).

Thank you for your interesting in tsearch - after recheck of problem pointed
above I'll commit your patch.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-05 15:56:42
Message-ID:	46DED1BA.3020007@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

> I fixed this by making the left-field a uint32. There's no reason to
> arbitrarily limit it to 16-bits, and it won't increase the disk/memory
> footprint either now that QueryOperator and QueryOperand are separate
> structs.
...
> I added check_stack_depth() call to all recursive functions I found.
> Some of them might have a natural limit so that you can't force
> arbitrarily deep recursions, but check_stack_depth() is cheap enough
> that seems best to just stick it into anything that might be a problem.
Agreed.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-05 16:30:47
Message-ID:	46DED9B7.3010308@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Teodor Sigaev wrote:
> Heikki, I see some strange changes in your patch, not related to tsearch
> at all:
> contrib/pageinspect/pageinspect.sql.in
> contrib/pageinspect/rawpage.c

Oh, sorry about that. Just ignore them, they are some quick&dirty
debugging functions I had in the same source tree.

>> The usage of the QueryItem struct was very confusing. It was used for
>> both operators and operands. For operators, "val" was a single character
>> casted to a int4, marking the operator type. For operands, val was the
>> CRC-32 of the value. Other fields were used only either for operands or
>> for operators. The biggest change in the patch is that I broke the
>> QueryItem struct into QueryOperator and QueryOperand. Type was really
> ...
>> - Removed ParseQueryNode struct used internally by makepol and friends.
>> push*-functions now construct QueryItems directly.
>
> It's needed to set unused bytes in QueryItem to zero, it's common
> requiremens for types in pgsql. After allocating space for tsquery in
> parse_tsquery you copy just sizeof(QueryOperator) bytes and leave
> sizeof(QueryItem) - sizeof(QueryOperator) bytes untouched. QueryOperand
> is a biggest component in QueryItem union. I don't check other places.

Ok. Probably easiest to do that by changing the palloc to palloc0 in
parse_tsquery.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-05 16:41:06
Message-ID:	46DEDC22.6080904@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

> Ok. Probably easiest to do that by changing the palloc to palloc0 in
> parse_tsquery.

and change sizeof to sizeof(QueryItem)

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-05 16:55:04
Message-ID:	46DEDF68.8040802@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Teodor Sigaev wrote:
>> Ok. Probably easiest to do that by changing the palloc to palloc0 in
>> parse_tsquery.
>
> and change sizeof to sizeof(QueryItem)

Do you mean the sizeofs in the memcpys in parse_tsquery? You can't
change them. The source structs are allocated in
pushOperand/pushOperator, using sizeof(QueryOperand/QueryOperator), so
if you copy sizeof(QueryItem) bytes from them, you'll copy some random
piece of memory. If you just allocate the TSQuery with palloc0, that'll
make sure that any memory area not copied into in the loop will be zero.

BTW, can you explain what the CRC-32 of a value is used for? It looks
like it's used to speed up some operations, by comparing the CRCs before
comparing the values, but I didn't quite figure out how it works. How
much of a performance difference does it make? Would hash_any do a
better/cheaper job?

In any case, I think we need to calculate the CRC/hash in tsqueryrecv,
instead of trusting the client.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-05 17:13:45
Message-ID:	46DEE3C9.8060805@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> Teodor Sigaev wrote:
>>> Ok. Probably easiest to do that by changing the palloc to palloc0 in
>>> parse_tsquery.
>> and change sizeof to sizeof(QueryItem)
>
> Do you mean the sizeofs in the memcpys in parse_tsquery? You can't

Oops, I meant pallocs in push* function. palloc0 in parse_tsquery is another way.

>
> BTW, can you explain what the CRC-32 of a value is used for? It looks
> like it's used to speed up some operations, by comparing the CRCs before
> comparing the values, but I didn't quite figure out how it works. How
It's mostly used in GiST indexes - recalculating crc32 every time for each index
tuple to be checked is rather expensive.

> much of a performance difference does it make? Would hash_any do a
> better/cheaper job?
crc32 was chosen after testing a lot of hash function. Perl's hash was the
fastest, but crc32 makes much less number of collisions. That's interesting for
ASCII a lot of functions produce rather small number of collision, but for upper
part of table (0x7f-0xff) crc32 was the best. CRC32 has evenly distributed
collisions over characters, others - not.

> In any case, I think we need to calculate the CRC/hash in tsqueryrecv,
> instead of trusting the client.
Agreed.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-06 16:46:09
Message-ID:	46E02ED1.4060508@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Teodor Sigaev wrote:
> Heikki Linnakangas wrote:
>> In any case, I think we need to calculate the CRC/hash in tsqueryrecv,
>> instead of trusting the client.
> Agreed.

I started to write a patch for that, as I realized that we're
transferring the strings in a tsvector/tsquery in server encoding.
That's not good, right? A dump created with COPY ... BINARY wouldn't be
portable across clusters with different encodings, for example.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-06 18:59:12
Message-ID:	511.1189105152@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> I started to write a patch for that, as I realized that we're
> transferring the strings in a tsvector/tsquery in server encoding.
> That's not good, right? A dump created with COPY ... BINARY wouldn't be
> portable across clusters with different encodings, for example.

Any portion of a binary value that is considered textual should be
converted to and from client encoding --- cf textsend/textrecv.
This should be pretty trivial to fix, just call a different support
routine.

BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes,
or do you want someone else to do it?

regards, tom lane

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-07 09:53:16
Message-ID:	46E11F8C.5020903@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Any portion of a binary value that is considered textual should be
> converted to and from client encoding --- cf textsend/textrecv.
> This should be pretty trivial to fix, just call a different support
> routine.

You do need to adjust length and position fields in the structs as well.
I fixed (rewrote, almost) the send/recv functions, and added a comment
above them describing the on-wire format. The CRC is now recalculated in
tsquery as well per previous discussion.

Patch attached. This is on top of the previous patches I sent. It
includes some additional changes that I had already started with. Most
notably:

- change the alignment requirement of lexemes in TSVector slightly.
Lexeme strings were always padded to 2-byte aligned length to make sure
that if there's position array (uint16[]) it has the right alignment.
The patch changes that so that the padding is not done when there's no
positions. That makes the storage of tsvectors without positions
slightly more compact.

- added some #include "miscadmin.h" lines I missed in the earlier when I
added calls to check_stack_depth().

BTW, the encoding of the XML datatype looks pretty funky. xml_recv first
reads the xml string with pq_getmsgtext, which applies a client->server
conversion. Then the xml declaration is parsed, extracting the encoding
attribute. Then the string is converted again from that encoding (or
UTF-8 if none was specified) to server encoding. I don't understand how
it's supposed to work, but ISTM there's one conversion too much,

> BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes,
> or do you want someone else to do it?

I am getting confused with the patches and version I have lying around
here... I think I'll have to wait for review of the patches I've posted
this far before I continue hacking.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
tsearch-incr-2.patch	text/x-diff	34.2 KB

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	XML binary I/O (was Re: tsearch refactorings)
Date:	2007-09-07 11:02:20
Message-ID:	46E12FBC.2080306@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> BTW, the encoding of the XML datatype looks pretty funky. xml_recv first
> reads the xml string with pq_getmsgtext, which applies a client->server
> conversion. Then the xml declaration is parsed, extracting the encoding
> attribute. Then the string is converted again from that encoding (or
> UTF-8 if none was specified) to server encoding. I don't understand how
> it's supposed to work, but ISTM there's one conversion too much,

And it's got an unfortunate typo in it as well: it calls "free(result)"
instead of pfree. I think we need regression tests for the more complex
send/recv functions...

What's the difference between text and binary mode for something like
xml anyway? Could we just call the text format in/out functions and be
done with it?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-07 14:41:23
Message-ID:	46E16313.2030803@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

> BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes,
> or do you want someone else to do it?
I'll do it.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-07 16:04:59
Message-ID:	46E176AB.3000908@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

> I am getting confused with the patches and version I have lying around
> here... I think I'll have to wait for review of the patches I've posted
> this far before I continue hacking.
Sorry for delay - I was busy by another. All your patches are committed with
very small changes.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-07 21:28:52
Message-ID:	46E1C294.1010007@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Teodor Sigaev wrote:
>> I am getting confused with the patches and version I have lying around
>> here... I think I'll have to wait for review of the patches I've posted
>> this far before I continue hacking.
> Sorry for delay - I was busy by another. All your patches are committed
> with very small changes.

Thanks!

Can you please apply this fix for the bug Pavel found as well:

http://archives.postgresql.org/pgsql-hackers/2007-09/msg00127.php

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Cc:	"Peter Eisentraut" <peter_e(at)gmx(dot)net>
Subject:	Re: XML binary I/O (was Re: tsearch refactorings)
Date:	2007-09-14 11:55:14
Message-ID:	46EA76A2.8090303@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> BTW, the encoding of the XML datatype looks pretty funky. xml_recv first
>> reads the xml string with pq_getmsgtext, which applies a client->server
>> conversion. Then the xml declaration is parsed, extracting the encoding
>> attribute. Then the string is converted again from that encoding (or
>> UTF-8 if none was specified) to server encoding. I don't understand how
>> it's supposed to work, but ISTM there's one conversion too much,
>
> And it's got an unfortunate typo in it as well: it calls "free(result)"
> instead of pfree. I think we need regression tests for the more complex
> send/recv functions...

According to the docs, xml_send is supposed to output the XML document
in client encoding, with that encoding specified in the xml declaration,
eg. ?<xml encoding="LATIN1"?>. xml_recv is supposed to ignore client
encoding, and parse the document according to the encoding specified in
the declaration.

Here's a patch that fixes the send/recv functions to work like the
manual says. It fixes the free/pfree typo as well.

Included is a new regression test for the send/recv functions as well
(unpatched CVS HEAD fails it BTW). It can be merged with the existing
xml test, but I didn't want to do it in the patch because it would've
involved moving the current xml test from sql/ to input/, and made the
patch longer to review.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
fix-xml-binaryio-1.patch	text/x-diff	6.6 KB

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: XML binary I/O (was Re: tsearch refactorings)
Date:	2007-09-14 13:25:30
Message-ID:	20070914132530.GA4905@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:

> + -- Basic test
> + COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
> + TRUNCATE xmltest;
> + COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
> + SELECT * FROM xmltest;
> + id | data
> + ----+--------------------
> + 1 | <value>one</value>
> + 2 | <value>two</value>
> + (2 rows)

This isn't going to work ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>
Subject:	Re: XML binary I/O (was Re: tsearch refactorings)
Date:	2007-09-14 13:37:13
Message-ID:	46EA8E89.8030106@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>
>> + -- Basic test
>> + COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>> + TRUNCATE xmltest;
>> + COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>> + SELECT * FROM xmltest;
>> + id | data
>> + ----+--------------------
>> + 1 | <value>one</value>
>> + 2 | <value>two</value>
>> + (2 rows)
>
> This isn't going to work ...

Care to elaborate?

We could test a lot more, but that at least calls the send/recv functions.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Patches" <pgsql-patches(at)postgresql(dot)org>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>
Subject:	Re: XML binary I/O (was Re: tsearch refactorings)
Date:	2007-09-14 13:38:21
Message-ID:	46EA8ECD.3020903@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>>
>>> + -- Basic test
>>> + COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>>> + TRUNCATE xmltest;
>>> + COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>>> + SELECT * FROM xmltest;
>>> + id | data
>>> + ----+--------------------
>>> + 1 | <value>one</value>
>>> + 2 | <value>two</value>
>>> + (2 rows)
>> This isn't going to work ...
>
> Care to elaborate?
>
> We could test a lot more, but that at least calls the send/recv functions.

Oh, I see what you mean. Yeah, the absolute paths need to be fixed :).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Teodor Sigaev <teodor(at)sigaev(dot)ru>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: tsearch refactorings
Date:	2007-09-14 16:45:21
Message-ID:	200709141645.l8EGjL404422@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Done by Teodor.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Teodor Sigaev wrote:
> >> I am getting confused with the patches and version I have lying around
> >> here... I think I'll have to wait for review of the patches I've posted
> >> this far before I continue hacking.
> > Sorry for delay - I was busy by another. All your patches are committed
> > with very small changes.
>
> Thanks!
>
> Can you please apply this fix for the bug Pavel found as well:
>
> http://archives.postgresql.org/pgsql-hackers/2007-09/msg00127.php
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Patches" <pgsql-patches(at)postgresql(dot)org>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>
Subject:	Re: XML binary I/O (was Re: tsearch refactorings)
Date:	2007-09-14 17:04:34
Message-ID:	11704.1189789474@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> Oh, I see what you mean. Yeah, the absolute paths need to be fixed :).

Easier to just remove the regression test altogether. Not every patch
requires permanent memorialization in a regression test.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>
Subject:	Re: XML binary I/O (was Re: tsearch refactorings)
Date:	2007-09-23 21:38:40
Message-ID:	9272.1190583520@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> Here's a patch that fixes the send/recv functions to work like the
> manual says. It fixes the free/pfree typo as well.

Applied with a further fix: the patch caused xml_recv to call
parse_xml_decl with a non-null-terminated string, which could in the
worst case lead to a crash.

> Included is a new regression test for the send/recv functions as well

I didn't apply this, as it seemed more trouble than it was worth.

regards, tom lane