Review of patch Bugfix for XPATH() if expression returns a scalar value

Lists: pgsql-hackers
From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Florian Pflug <fgp(at)phlo(dot)org>
Subject: Review of patch Bugfix for XPATH() if expression returns a scalar value
Date: 2011-06-29 17:57:03
Message-ID: 201106291957.04251.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is review of patch
https://commitfest.postgresql.org/action/patch_view?id=565
"Bugfix for XPATH() if expression returns a scalar value"

Patch applies cleanly, and compiles cleanly too, I didn't checked tests.

Form discussion about patch, and referenced thread in this patch
http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php, if I
understand good such functionality is desired.

This patch, I think, gives good approach for dealing with scalar values,
and, as well converting value to it's string representation is good too
(according to current support for xml), with one exception detailed below.

In this patch submitter, similarly to
https://commitfest.postgresql.org/action/patch_view?id=580, added
functionality for XML-escaping of some kind of values (I think only string
scalars are escaped), which is not-natural and may lead to double escaping in
some situation, example query may be:

SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth)) FROM (SELECT
(XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name
"root", XMLATTRIBUTES('<n' AS xmlns, '<v' AS value),'<t'))) v(x)) as foo;

xmlelement
-------------------------
<root sth="&amp;lt;n"/>

It's clearly visible that value from attribute is "&lt;n", not "<". Every
parser will read this as "&lt;n" which is not-natural and will require form
consumer/developer to de-escape this on his side - roughly speaking this will
be reported as serious bug.
I didn't found good reasons why XML-escaping should be included, submitter
wrote about inserting this to xml column and possibility of data damage, but
didn't give an example. Such example is desired.

Conclusion
I vote +1 for this patch if escaping will be removed.

Regards,
Radoslaw Smogura


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review of patch Bugfix for XPATH() if expression returns a scalar value
Date: 2011-06-29 20:26:39
Message-ID: 6A2C047F-7EE4-430D-9813-A1B215D81D56@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun29, 2011, at 19:57 , Radosław Smogura wrote:
> This is review of patch
> https://commitfest.postgresql.org/action/patch_view?id=565
> "Bugfix for XPATH() if expression returns a scalar value"
>
> SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth)) FROM (SELECT
> (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name
> "root", XMLATTRIBUTES('<n' AS xmlns, '<v' AS value),'<t'))) v(x)) as foo;
>
> xmlelement
> -------------------------
> <root sth="&amp;lt;n"/>
>
> It's clearly visible that value from attribute is "&lt;n", not "<". Every
> parser will read this as "&lt;n" which is not-natural and will require form
> consumer/developer to de-escape this on his side - roughly speaking this will
> be reported as serious bug.

There's a problem there, no doubt, but your proposed solution just moves the
problem around.

Here's your query, reformatted to be more legible

SELECT
XMLELEMENT(name root,
XMLATTRIBUTES(foo.namespace AS sth)
)
FROM (
SELECT
(XPATH('namespace-uri(/*)', x))[1] AS namespace
FROM (VALUES (XMLELEMENT(name "root",
XMLATTRIBUTES('<n' AS xmlns,
'<v' AS value),'<t')
)
) v(x)) as foo;

What happens is that the XPATH expression selects the xmlns
attribute with the value '<n'. To be well-formed xml, that value
must be escaped, so what is actually returned by the XPATH
call is '&lt;n'. The XMLATTRIBUTES() function, however, doesn't
distinguish between input of type XML and input of type TEXT,
so it figures it has to represent the *textual* value '&lt;n'
in xml, and produces '&amp;lt;n'.

Not escaping textual values returned by XPATH isn't a solution,
though. For example, assume someone inserts the value returned
by the XPATH() call in your query into a column of type XML.
If XPATH() returned '<n' literally instead of escaping it,
the column would contain non-well-formed XML in a column of type
XML. Not pretty, and pretty fatal during your next dump/reload
cycle.

Or, if you want another example, simply remove the XMLATTRIBUTES
call and use the value returned by XPATH as a child node directly.

SELECT
XMLELEMENT(name root,
foo.namespace
)
FROM (
SELECT
(XPATH('namespace-uri(/*)', x))[1] AS namespace
FROM (VALUES (XMLELEMENT(name "root",
XMLATTRIBUTES('<n' AS xmlns,
'<v' AS value),'<t')
)
) v(x)) as foo;

xmlelement
--------------------
<root>&lt;n</root>

Note that this correct! The <root> node contains a text node
representing the textual value '<n'. If XPATH() hadn't return
the value escaped, the query above would have produces

<root><n</root>

which is obviously wrong. It works in this case because XMLELEMENT
is smart enough to distinguish between child nodes gives as TEXT
values (those are escaped) and child nodes provided as XML values
(those are inserted unchanged).

XMLATTRIBUTES, however, doesn't make that distinction, and simply
escaped all attributes values independent of their type. Now, the
reason it does that is probably that *not* all XML values are
well-formed attribute values without additional escaping. For example,
you cannot have literal '<' and '>' in attribute values. So if
XMLATTRIBUTES, like XMLELEMENT never escaped values which are
already of type XML, the following piece of code

XMLELEMENT(name "a", XMLATTRIBUTES('<node/>'::xml as v))

would produce

<a v="<node/>"/>

which, alas, is not well-formed :-(

The correct solution, I believe, is to teach XMLATTRIBUTES to
only escape those characters which are invalid in attribute
values but valid in XML (Probably only '<' and '>' but I'll
check). If there are no objects, I'll prepare a separate patch
for that. I don't want to include it in this patch because it's
really a distinct issue (even modifies a different function).

best regards,
Florian Pflug


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review of patch Bugfix for XPATH() if expression returns a scalar value
Date: 2011-06-30 08:33:57
Message-ID: fc947491f738a2c6c2232ec1eecfa51a@mail.softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 29 Jun 2011 22:26:39 +0200, Florian Pflug wrote:
> On Jun29, 2011, at 19:57 , Radosław Smogura wrote:
>> This is review of patch
>> https://commitfest.postgresql.org/action/patch_view?id=565
>> "Bugfix for XPATH() if expression returns a scalar value"
>>
>> SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth))
>> FROM (SELECT
>> (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES
>> (XMLELEMENT(name
>> "root", XMLATTRIBUTES('<n' AS xmlns, '<v' AS value),'<t'))) v(x)) as
>> foo;
>>
>> xmlelement
>> -------------------------
>> <root sth="&amp;lt;n"/>
>>
>> It's clearly visible that value from attribute is "&lt;n", not "<".
>> Every
>> parser will read this as "&lt;n" which is not-natural and will
>> require form
>> consumer/developer to de-escape this on his side - roughly speaking
>> this will
>> be reported as serious bug.
>
> There's a problem there, no doubt, but your proposed solution just
> moves the
> problem around.
>
> Here's your query, reformatted to be more legible
>
> SELECT
> XMLELEMENT(name root,
> XMLATTRIBUTES(foo.namespace AS sth)
> )
> FROM (
> SELECT
> (XPATH('namespace-uri(/*)', x))[1] AS namespace
> FROM (VALUES (XMLELEMENT(name "root",
> XMLATTRIBUTES('<n' AS xmlns,
> '<v' AS value),'<t')
> )
> ) v(x)) as foo;
>
> What happens is that the XPATH expression selects the xmlns
> attribute with the value '<n'. To be well-formed xml, that value
> must be escaped, so what is actually returned by the XPATH
> call is '&lt;n'. The XMLATTRIBUTES() function, however, doesn't
> distinguish between input of type XML and input of type TEXT,
> so it figures it has to represent the *textual* value '&lt;n'
> in xml, and produces '&amp;lt;n'.
>
> Not escaping textual values returned by XPATH isn't a solution,
> though. For example, assume someone inserts the value returned
> by the XPATH() call in your query into a column of type XML.
> If XPATH() returned '<n' literally instead of escaping it,
> the column would contain non-well-formed XML in a column of type
> XML. Not pretty, and pretty fatal during your next dump/reload
> cycle.
>
> Or, if you want another example, simply remove the XMLATTRIBUTES
> call and use the value returned by XPATH as a child node directly.
>
> SELECT
> XMLELEMENT(name root,
> foo.namespace
> )
> FROM (
> SELECT
> (XPATH('namespace-uri(/*)', x))[1] AS namespace
> FROM (VALUES (XMLELEMENT(name "root",
> XMLATTRIBUTES('<n' AS xmlns,
> '<v' AS value),'<t')
> )
> ) v(x)) as foo;
>
> xmlelement
> --------------------
> <root>&lt;n</root>
>
> Note that this correct! The <root> node contains a text node
> representing the textual value '<n'. If XPATH() hadn't return
> the value escaped, the query above would have produces
>
> <root><n</root>
>
> which is obviously wrong. It works in this case because XMLELEMENT
> is smart enough to distinguish between child nodes gives as TEXT
> values (those are escaped) and child nodes provided as XML values
> (those are inserted unchanged).
>
> XMLATTRIBUTES, however, doesn't make that distinction, and simply
> escaped all attributes values independent of their type. Now, the
> reason it does that is probably that *not* all XML values are
> well-formed attribute values without additional escaping. For
> example,
> you cannot have literal '<' and '>' in attribute values. So if
> XMLATTRIBUTES, like XMLELEMENT never escaped values which are
> already of type XML, the following piece of code
>
> XMLELEMENT(name "a", XMLATTRIBUTES('<node/>'::xml as v))
>
> would produce
>
> <a v="<node/>"/>
>
> which, alas, is not well-formed :-(
>
> The correct solution, I believe, is to teach XMLATTRIBUTES to
> only escape those characters which are invalid in attribute
> values but valid in XML (Probably only '<' and '>' but I'll
> check). If there are no objects, I'll prepare a separate patch
> for that. I don't want to include it in this patch because it's
> really a distinct issue (even modifies a different function).
>
> best regards,
> Florian Pflug

You may manually enter invalid xml too, You don't need xpath for this.
Much more
In PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 64-bit
I executed
SELECT XMLELEMENT(name "a", XMLATTRIBUTES('<node/>'::xml as v))
and it's looks like I got correct output "<a v="&lt;node/&gt;"/>"
when I looked with text editor into table files I saw same value.

I will check on last master if I can reproduce this.

But indeed, now PostgreSQL is not type safe against XML type. See
SELECT XMLELEMENT(name "root", '<', (xpath('//text()',
'<root>&lt;</root>'))[1])).

You found some problem with escaping, but solution is bad, I think
problem lies with XML type, which may be used to hold strings and proper
xml. For example above query can't distinguish if (xpath('//text()',
'<root>&lt;</root>'))[1] is xml text, or xml element.

For me adding support for scalar is really good and needed, but
escaping is not.

Regards,
Radosław Smogura


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review of patch Bugfix for XPATH() if expression returns a scalar value
Date: 2011-06-30 10:04:39
Message-ID: 316D7366-0194-4836-AE6A-D99ACBAE9996@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun30, 2011, at 10:33 , Radosław Smogura wrote:
> You may manually enter invalid xml too, You don't need xpath for this.

Please give an example, instead of merely asserting
I get

postgres=# select '<'::xml;
ERROR: invalid XML content at character 8

> Much more
> In PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 64-bit
> I executed
> SELECT XMLELEMENT(name "a", XMLATTRIBUTES('<node/>'::xml as v))
> and it's looks like I got correct output "<a v="&lt;node/&gt;"/>"
> when I looked with text editor into table files I saw same value.

I fail to see your point. XMLATTRIBUTES currently escapes
the value unconditionally, which happens to work as expected in
this case. But try
SELECT XMLELEMENT(name "a", XMLATTRIBUTES('&lt;'::xml as v))
which returns
<a v="&amp;lt;"/>
which I guess isn't what one would expect, otherwise one wouldn't
have added the cast to xml.

I believe the latter example should probably return
<a v="&lt;"/>
instead, just as XMLELEMENT(name "a", '&lt;'::xml) returns
<a>&lt;</a>

But again, this has little to do with the patch at hand, and should
therefore be discussed separately. The fact that XPATH's failure to
escape it's return value happens to fit XMLATTRIBUTE's failure to
not escape it's input value doesn't prove that both functions are
correct. Especially not if other functions like XMLELEMENT *don't*
escape their input values of they're already of type XML.

> But indeed, now PostgreSQL is not type safe against XML type. See
> SELECT XMLELEMENT(name "root", '<', (xpath('//text()', '<root>&lt;</root>'))[1])).

That is *exactly* what my other patch fixes, the one you deem
undesirable. With that other patch applied, I get

xmlelement
-----------------------
<root>&lt;&lt;</root>

wich is correct. So here too I fail to see your point.

> You found some problem with escaping, but solution is bad,
> I think problem lies with XML type, which may be used to hold strings
> and proper xml.

But it's not *supposed* to hold arbitrary strings. If it was, the
input function wouldn't check whether or not the value was valid or
not.

Do you agree that, for any type, "SELECT value_of_type_X::text::X"
should never throw an error? Because that, IMHO quite basic, guarantee
is broken without proper escaping of XML values. Try

SELECT (XPATH('/text()', '<t>&lt;</t>'::xml))[1]::text::xml

on unpatched 9.0. It fails, which IMNSHO is very clearly a bug.

> For example above query can't distinguish if
> (xpath('//text()', '<root>&lt;</root>'))[1] is xml text, or xml element.

Text *is* a kind of element in XML. There are different types of
nodes, one of which are is "text node". The fact that it's not surrounded
by <..>, </..> doesn't make a different, and doesn't change the quoting
rules one bit.

> For me adding support for scalar is really good and needed, but escaping is not.

So far, I remain unconvinced by your arguments. What you argue for
might be sensible if we *didn't* have an XML type and instead simply
had stuff like XPATH() for type TEXT. But we *do* have a type XML,
which *does* validate it's input, so I fail to see how allowing values
of type XML which *don't* pass that input validation can be anything but
a bug.

Compare this to the textual types. We've gone through some pain in
the past to guarantee that textual types never hold values which aren't
valid text in the database's encoding (e.g, we prevent textual values
from containing bytes sequences which aren't valid UTF-8 if the database
encoding is UTF-8). It follows that we should do the same for XML types.

best regards,
Florian Pflug


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review of patch Bugfix for XPATH() if expression returns a scalar value
Date: 2011-07-04 22:16:36
Message-ID: 85158430-820B-45FF-978A-C060B767F923@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Radosław,

Do you still have objections, or did I manage to convince you?

Also, aside from the question of whether to escape or not, did
you find any issues with either the patch (like spurious whitespace,
...) or the code, or is the patch OK implementation-wise?

best regards,
Florian Pflug


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review of patch Bugfix for XPATH() if expression returns a scalar value
Date: 2011-07-14 22:44:24
Message-ID: B7E3F700-3394-40B1-AC24-FBE3C0994974@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Radosław,

Do you plan to comment on this patch (the one that adds support
for XPATH expressions returning scalar values, not the one that
escapes text and attributes nodes which are selected) further,
or should it be marked "Ready for Committer"?

I'm asking because I just noticed that you marked the other patch
as "Ready for Commiter", but not this one.

best regards,
Florian Pflug