XPATH evaluation

Lists: pgsql-hackers
From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: XPATH evaluation
Date: 2011-06-17 09:09:39
Message-ID: f64d8629aea32338b688bc8cd58ae566@mail.softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

During review of
https://commitfest.postgresql.org/action/patch_view?id=580 I found
following problems with XPath.

1.
SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db"
xmlns:p="http://postgresql.org/db"><o:db><a><b></b></a></o:db><p:db></p:db></root>'));
Produces:
"{"<o:db>
<a>
<b/>
</a>
</o:db>",<p:db/>}"
In above <b></b> was reduced to <b/> this is different infoset then
input, and those notations are differently interpreted e.g. by XML
Binding & WebServices. The 1st one will may be mapped to empty string,
and 2nd one to to null.

As well result was formatted which produces again different infoset.

Both of above may cause problems with XML digesting.

2.
SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db"
xmlns:p="http://postgresql.org/db"><o:db></o:db><p:db></p:db></root>'));
"{<o:db/>,<p:db/>}"
In above I missing namespaces.

I may take on assignment 1st (fix is simple), but for 2nd I have mixed
fillings. I think 2nd should transfer namespaces in some way to client.

What do You think?

Regards,
Radosław Smogura


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 09:41:08
Message-ID: AFE4ACB6-893D-4E70-B5C9-847236B3AFCF@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun17, 2011, at 11:09 , Radosław Smogura wrote:
> 1.
> SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db" xmlns:p="http://postgresql.org/db"><o:db><a><b></b></a></o:db><p:db></p:db></root>'));
> Produces:
> "{"<o:db>
> <a>
> <b/>
> </a>
> </o:db>",<p:db/>}"
> In above <b></b> was reduced to <b/> this is different infoset then input, and those notations are differently interpreted e.g. by XML Binding & WebServices. The 1st one will may be mapped to empty string, and 2nd one to to null.

Oh, joy :-(

Does this happen only with my patch applied or also with unpatched HEAD?

> 2.
> SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db" xmlns:p="http://postgresql.org/db"><o:db></o:db><p:db></p:db></root>'));
> "{<o:db/>,<p:db/>}"
> In above I missing namespaces.

Hm, that's a hard problem a think. Your problem (1) basically tells us that
ideally we'd return the matching parts of an XML document unmodified. Now,
(2) tells us that isn't to most sensible thing to do either.

> I may take on assignment 1st (fix is simple)

Whats your proposed fix for (1)?

> , but for 2nd I have mixed fillings. I think 2nd should transfer namespaces
> in some way to client.

I don't see how XPATH() can do that without breaking it's API. The only
thing we could do AFAICS is the define a second XPATH evaluation function
which returns a list of namespace declarations (prefix and uri) for every
node.

best regards,
Florian Pflug


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 13:47:04
Message-ID: 4DFB5AD8.3030705@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/17/2011 05:41 AM, Florian Pflug wrote:
> On Jun17, 2011, at 11:09 , Radosław Smogura wrote:
>> 1.
>> SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db" xmlns:p="http://postgresql.org/db"><o:db><a><b></b></a></o:db><p:db></p:db></root>'));
>> Produces:
>> "{"<o:db>
>> <a>
>> <b/>
>> </a>
>> </o:db>",<p:db/>}"
>> In above<b></b> was reduced to<b/> this is different infoset then input, and those notations are differently interpreted e.g. by XML Binding& WebServices. The 1st one will may be mapped to empty string, and 2nd one to to null.
> Oh, joy :-(

I thought these were basically supposed to be the same.

The XML Information Set for example specifically excludes:

The difference between the two forms of an empty element: |<foo/> |
and |<foo></foo>|.||||

|
See <http://www.w3.org/TR/2004/REC-xml-infoset-20040204/> Appendix D.
Note that this implies that <foo></foo> does not have content of an
empty string, but that it has no content.

|
cheers

andrew


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 14:55:06
Message-ID: 201106171655.07407.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> Friday 17 of June 2011 15:47:04
> On 06/17/2011 05:41 AM, Florian Pflug wrote:
> > On Jun17, 2011, at 11:09 , Radosław Smogura wrote:
> >> 1.
> >> SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db"
> >> xmlns:p="http://postgresql.org/db"><o:db><a><b></b></a></o:db><p:db></p
> >> :db></root>')); Produces:
> >> "{"<o:db>
> >>
> >> <a>
> >>
> >> <b/>
> >>
> >> </a>
> >>
> >> </o:db>",<p:db/>}"
> >> In above<b></b> was reduced to<b/> this is different infoset then
> >> input, and those notations are differently interpreted e.g. by XML
> >> Binding& WebServices. The 1st one will may be mapped to empty string,
> >> and 2nd one to to null.
> >
> > Oh, joy :-(
>
> I thought these were basically supposed to be the same.
>
> The XML Information Set for example specifically excludes:
>
> The difference between the two forms of an empty element: |<foo/> |
> and |<foo></foo>|.||||
>
>
> See <http://www.w3.org/TR/2004/REC-xml-infoset-20040204/> Appendix D.
> Note that this implies that <foo></foo> does not have content of an
> empty string, but that it has no content.
>
>
> cheers
>
> andrew

Indeed, Infoset Spec, and XML Canonization Spec treats <foo></foo> same, as
<foo/> - my wrong, but XML canonization preservs whitespaces, if I remember
well, I think there is example.

In any case if I will store image in XML (I've seen this), preservation of
white spaces and new lines is important.

Regards,
Radek.


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 15:03:23
Message-ID: 201106171703.24290.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> Friday 17 of June 2011 11:41:08
> On Jun17, 2011, at 11:09 , Radosław Smogura wrote:
> > 1.
> > SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db"
> > xmlns:p="http://postgresql.org/db"><o:db><a><b></b></a></o:db><p:db></p:
> > db></root>')); Produces:
> > "{"<o:db>
> >
> > <a>
> >
> > <b/>
> >
> > </a>
> >
> > </o:db>",<p:db/>}"
> > In above <b></b> was reduced to <b/> this is different infoset then
> > input, and those notations are differently interpreted e.g. by XML
> > Binding & WebServices. The 1st one will may be mapped to empty string,
> > and 2nd one to to null.
>
> Oh, joy :-(
>
> Does this happen only with my patch applied or also with unpatched HEAD?
>
> > 2.
> > SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db"
> > xmlns:p="http://postgresql.org/db"><o:db></o:db><p:db></p:db></root>'));
> > "{<o:db/>,<p:db/>}"
> > In above I missing namespaces.
>
> Hm, that's a hard problem a think. Your problem (1) basically tells us that
> ideally we'd return the matching parts of an XML document unmodified. Now,
> (2) tells us that isn't to most sensible thing to do either.
>
> > I may take on assignment 1st (fix is simple)
>
> Whats your proposed fix for (1)?
>
> > , but for 2nd I have mixed fillings. I think 2nd should transfer
> > namespaces in some way to client.
>
> I don't see how XPATH() can do that without breaking it's API. The only
> thing we could do AFAICS is the define a second XPATH evaluation function
> which returns a list of namespace declarations (prefix and uri) for every
> node.
>
> best regards,
> Florian Pflug

No this is not about Your patch, but was inspired by it.
Regards,
Radek


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 15:09:25
Message-ID: 4DFB6E25.1000002@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/17/2011 10:55 AM, Radosław Smogura wrote:
> Andrew Dunstan<andrew(at)dunslane(dot)net> Friday 17 of June 2011 15:47:04
>> On 06/17/2011 05:41 AM, Florian Pflug wrote:
>>> On Jun17, 2011, at 11:09 , Radosław Smogura wrote:
>>>> 1.
>>>> SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db"
>>>> xmlns:p="http://postgresql.org/db"><o:db><a><b></b></a></o:db><p:db></p
>>>> :db></root>')); Produces:
>>>> "{"<o:db>
>>>>
>>>> <a>
>>>>
>>>> <b/>
>>>>
>>>> </a>
>>>>
>>>> </o:db>",<p:db/>}"
>>>> In above<b></b> was reduced to<b/> this is different infoset then
>>>> input, and those notations are differently interpreted e.g. by XML
>>>> Binding& WebServices. The 1st one will may be mapped to empty string,
>>>> and 2nd one to to null.
>>> Oh, joy :-(
>> I thought these were basically supposed to be the same.
>>
>> The XML Information Set for example specifically excludes:
>>
>> The difference between the two forms of an empty element: |<foo/> |
>> and |<foo></foo>|.||||
>>
>>
>> See<http://www.w3.org/TR/2004/REC-xml-infoset-20040204/> Appendix D.
>> Note that this implies that<foo></foo> does not have content of an
>> empty string, but that it has no content.
>>
>>
>> cheers
>>
>> andrew
> Indeed, Infoset Spec, and XML Canonization Spec treats<foo></foo> same, as
> <foo/> - my wrong, but XML canonization preservs whitespaces, if I remember
> well, I think there is example.
>
> In any case if I will store image in XML (I've seen this), preservation of
> white spaces and new lines is important.

If you store images you should encode them anyway, in base64 or hex.

More generally, data that needs that sort of preservation should
possibly be in CDATA nodes.

cheers

andrew


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 15:17:13
Message-ID: 213F6C12-BBA4-4605-8D49-EEF802CF1E3A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun17, 2011, at 17:09 , Andrew Dunstan wrote:
> If you store images you should encode them anyway, in base64 or hex.
> More generally, data that needs that sort of preservation should possibly be in CDATA nodes.

All very true.

Still, ideally we'd return the XML exactly as stored, though, even
for the results of XPATH queries. But I've no idea if this is easily
done with libxml or not.

best regards,
Florian Pflug


From: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 15:29:57
Message-ID: BANLkTikHw2=yv=tg-ya5STjpxR0jF3V4PQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/6/17, Andrew Dunstan <andrew(at)dunslane(dot)net>:

> On 06/17/2011 10:55 AM, Radosław Smogura wrote:
>
>> XML canonization preservs whitespaces, if I remember
>> well, I think there is example.
>>
>> In any case if I will store image in XML (I've seen this), preservation of
>> white spaces and new lines is important.
>
> If you store images you should encode them anyway, in base64 or hex.

Whitespace that is not at certain obviously irrelevant places (such as
right after "<", between attributes, outside of the whole document,
etc), and that is not defined to be irrelevant by some schema (if the
parser is schema-aware), is relevant. You cannot just muck around with
it and consider that correct.

> More generally, data that needs that sort of preservation should
> possibly be in CDATA nodes.

CDATA sections are just syntactic sugar (a form of escaping):

<URL:http://www.w3.org/TR/xml-infoset/#omitted>

"Appendix D: What is not in the Information Set
[..]
19. The boundaries of CDATA marked sections."

Therefore, there is not such thing as a "CDATA node" that would be
different from "just text" (Infoset-wise).

Note that that does not mean that binary data is never supposed to be
altered or that all binary data is to be accepted: e.g., whether
newlines are represented using "\n", "\r", or "\r\n" is irrelevant;
also, binary data that is not valid according to the used encoding
must of course not be accepted.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 15:43:04
Message-ID: 201106171743.04781.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> Friday 17 of June 2011 17:09:25
> On 06/17/2011 10:55 AM, Radosław Smogura wrote:
> > Andrew Dunstan<andrew(at)dunslane(dot)net> Friday 17 of June 2011 15:47:04
> >
> >> On 06/17/2011 05:41 AM, Florian Pflug wrote:
> >>> On Jun17, 2011, at 11:09 , Radosław Smogura wrote:
> >>>> 1.
> >>>> SELECT (XPATH('/root/*', '<root xmlns:o="http://olacle.com/db"
> >>>> xmlns:p="http://postgresql.org/db"><o:db><a><b></b></a></o:db><p:db></
> >>>> p
> >>>>
> >>>> :db></root>')); Produces:
> >>>> "{"<o:db>
> >>>>
> >>>> <a>
> >>>>
> >>>> <b/>
> >>>>
> >>>> </a>
> >>>>
> >>>> </o:db>",<p:db/>}"
> >>>> In above<b></b> was reduced to<b/> this is different infoset then
> >>>> input, and those notations are differently interpreted e.g. by XML
> >>>> Binding& WebServices. The 1st one will may be mapped to empty
> >>>> string, and 2nd one to to null.
> >>>
> >>> Oh, joy :-(
> >>
> >> I thought these were basically supposed to be the same.
> >>
> >> The XML Information Set for example specifically excludes:
> >> The difference between the two forms of an empty element: |<foo/>
> >> | and |<foo></foo>|.||||
> >>
> >> See<http://www.w3.org/TR/2004/REC-xml-infoset-20040204/> Appendix D.
> >> Note that this implies that<foo></foo> does not have content of an
> >> empty string, but that it has no content.
> >>
> >>
> >> cheers
> >>
> >> andrew
> >
> > Indeed, Infoset Spec, and XML Canonization Spec treats<foo></foo> same,
> > as <foo/> - my wrong, but XML canonization preservs whitespaces, if I
> > remember well, I think there is example.
> >
> > In any case if I will store image in XML (I've seen this), preservation
> > of white spaces and new lines is important.
>
> If you store images you should encode them anyway, in base64 or hex.
>
> More generally, data that needs that sort of preservation should
> possibly be in CDATA nodes.
>
> cheers
>
> andrew
I know this answer, because this solution is better. But, during one work I
created XSL-FO with whitespace preserve attribute, if I would like to get part
of such XSL-FO I could destroy output document.

But those use-cases doesn't change fact that XPATH output doesn't preserves
whitepsaces, newlines, and produces different node, then was in original. It
same as regexp form varchar will trim result without control.

I emphasize this because it may cause problems with XML Digest algorithms
which are quite popular and may cause some legal! problems when you try to use
Advance Signature in Europe Union, as well with other application.

With XML Binding it's quite popular to interpret <foo/> as null, <foo></foo>
as empty string. In particulary mantoined Infoset Spec doesn't matters here.

I think no-formatting is reasonable requirement for XPATH function.

Regards,
Radek.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 15:53:56
Message-ID: 4DFB7894.5040402@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/17/2011 11:29 AM, Nicolas Barbier wrote:
> 2011/6/17, Andrew Dunstan<andrew(at)dunslane(dot)net>:
>
>> On 06/17/2011 10:55 AM, Radosław Smogura wrote:
>>
>>> XML canonization preservs whitespaces, if I remember
>>> well, I think there is example.
>>>
>>> In any case if I will store image in XML (I've seen this), preservation of
>>> white spaces and new lines is important.
>> If you store images you should encode them anyway, in base64 or hex.
> Whitespace that is not at certain obviously irrelevant places (such as
> right after "<", between attributes, outside of the whole document,
> etc), and that is not defined to be irrelevant by some schema (if the
> parser is schema-aware), is relevant. You cannot just muck around with
> it and consider that correct.

Sure, but if you're storing arbitrary binary data such as images
whitespace is the least of your problems. That's why I've always encoded
them in base64.

>> More generally, data that needs that sort of preservation should
>> possibly be in CDATA nodes.
> CDATA sections are just syntactic sugar (a form of escaping):
>
> <URL:http://www.w3.org/TR/xml-infoset/#omitted>
>
>

Yeah. OTOH doesn't an empty CDATA section force a child element, where a
pure empty element does not?

Anyway, we're getting a bit far from what Postgres needs to be doing.

cheers

andrew


From: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-17 16:37:17
Message-ID: BANLkTikMmW6Ef71jtooVQJWGDXtbVnmM0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/6/17, Andrew Dunstan <andrew(at)dunslane(dot)net>:

> On 06/17/2011 11:29 AM, Nicolas Barbier wrote:
>
>> CDATA sections are just syntactic sugar (a form of escaping):
>
> Yeah. OTOH doesn't an empty CDATA section force a child element, where a
> pure empty element does not?

Wow, some Googling around shows that there is much confusion about
this. I thought that it was obvious that adding <![CDATA[]]> shouldn't
change the content at all, but quite a few people seem to disagree
:-/.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH evaluation
Date: 2011-06-18 16:51:26
Message-ID: 201106181851.26516.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com> Friday 17 of June 2011 17:29:57
> 2011/6/17, Andrew Dunstan <andrew(at)dunslane(dot)net>:
> > On 06/17/2011 10:55 AM, Radosław Smogura wrote:
> >> XML canonization preservs whitespaces, if I remember
> >> well, I think there is example.
> >>
> >> In any case if I will store image in XML (I've seen this), preservation
> >> of white spaces and new lines is important.
> >
> > If you store images you should encode them anyway, in base64 or hex.
>
> Whitespace that is not at certain obviously irrelevant places (such as
> right after "<", between attributes, outside of the whole document,
> etc), and that is not defined to be irrelevant by some schema (if the
> parser is schema-aware), is relevant. You cannot just muck around with
> it and consider that correct.
>
> > More generally, data that needs that sort of preservation should
> > possibly be in CDATA nodes.
>
> CDATA sections are just syntactic sugar (a form of escaping):
>
> <URL:http://www.w3.org/TR/xml-infoset/#omitted>
>
> "Appendix D: What is not in the Information Set
> [..]
> 19. The boundaries of CDATA marked sections."
>
> Therefore, there is not such thing as a "CDATA node" that would be
> different from "just text" (Infoset-wise).
>
> Note that that does not mean that binary data is never supposed to be
> altered or that all binary data is to be accepted: e.g., whether
> newlines are represented using "\n", "\r", or "\r\n" is irrelevant;
> also, binary data that is not valid according to the used encoding
> must of course not be accepted.
>
> Nicolas

I would like to send patch to remove formatting. How to deal with collapsing
blank nodes I don't know.

Regards,
Radek

Attachment Content-Type Size
no_format_with_xpath.patch text/x-patch 412 bytes