Re: [PATCH] Bug in XPATH() if expression returns a scalar value

Lists: pgsql-hackers
From: Florian Pflug <fgp(at)phlo(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-05-30 13:17:02
Message-ID: 33B182D9-D2C3-4BDF-A8C9-B896BC9B8629@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

The in-core XPATH() function currently only handles XPath expressions which
return a node set correctly. For XPath expressions which return boolean,
numeric or string values, XPATH() returns an empty array. (This is the case
for XPath expressions whose top-level syntactic construct isn't a path
expression but rather one of the built-in functions like name() or count()).

The XPath expression 'name(/*)', for example, is supposed to return 'root'
when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres,
however, currently returns an empty array.

A patch which fixes that is attached. It makes XPATH() return a single-element
array containing a textual representation of the string, numeric value or boolean
for XPath expressions which don't return a node set. For numeric and boolean
results, the textual representation is obtained by calling float8out() respectively
boolout(). These function's results are assumed to always be well-form XML. For
string results, xml_in() is used to verify that the string returned by the XPath
expression is a well-formed XML fragment. This is required because XPATH() could
otherwise be used to insert invalid XML fragments into columns of type XML.

The patch add a few tests of non-nodeset returning XPath expressions to the
XML regression tests.

I'm not 100% clear on whether I should add this to the next commit fest or not -
after all, it's more a bug-fix rather than a new feature. But to prevent this
from getting lost, I'll add it unless someone tells me not to.

best regards,
Florian Pflug

Attachment Content-Type Size
pg_xpath_returnvalue.v1.patch application/octet-stream 8.3 KB

From: Florian Pflug <fgp(at)phlo(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-05-31 14:19:29
Message-ID: BD84C376-8F4F-4ADC-9139-E5D7C6940B2A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry for the self-reply but I figured it'd be helpful to add information
that I discovered only after my initial post.

On May30, 2011, at 15:17 , Florian Pflug wrote:
> The XPath expression 'name(/*)', for example, is supposed to return 'root'
> when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres,
> however, currently returns an empty array.

In the mean while, I've discovered that this was discussed previously about
a year ago here:
http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php

The basic confusion seems to be whether XPATH() is supposed to work on
everything that http://www.w3.org/TR/xpath/ consideres to be an "Expression",
or only on what that document calls a "Location Path".

The difference is basically that "Location Paths" server purely as
predicates, i.e. *select* a subset of nodes from an XML fragment, while
"Expressions" can produce node sets *or* arbitrary scalar values
(boolean, numeric or string).

According to the thread from last summer, XLST handles this by defining
*two* constructs which evaluate XPath expressions, one for those which
return node sets (<xsl:template match="...">) and one for those which
return scalar values (<xsl:value-of select="...">).

My patch makes XPATH() work for both nodset-returning
*and* scalar-value-returning expressions. This has the advantage
of being simpler, but it does force the scalar values produced
by an XPath expression to be valid XML fragments. For boolean and
numeric values this isn't a problem, but it does limit what you
can do with string-returning XPath expressions.

If people deem this to be a problem, we could instead add a separate
function XPATH_VALUE() that returns VARCHAR, and make people use that
for scalar-value-returning expressions. However, to avoid confusion,
XPATH() should then be taught to raise an error if used for scalar-value
returning expressions, instead of silently returning an empty array as
it does now.

Thoughts, anyone?

best regards,
Florian Pflug


From: "Ross J(dot) Reedstrom" <reedstrm(at)rice(dot)edu>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-05-31 17:15:36
Message-ID: 20110531171536.GB2642@rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 31, 2011 at 04:19:29PM +0200, Florian Pflug wrote:
> Sorry for the self-reply but I figured it'd be helpful to add information
> that I discovered only after my initial post.
>
> On May30, 2011, at 15:17 , Florian Pflug wrote:
> > The XPath expression 'name(/*)', for example, is supposed to return 'root'
> > when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres,
> > however, currently returns an empty array.
>
> In the mean while, I've discovered that this was discussed previously about
> a year ago here:
> http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php
>
> The basic confusion seems to be whether XPATH() is supposed to work on
> everything that http://www.w3.org/TR/xpath/ consideres to be an "Expression",
> or only on what that document calls a "Location Path".
>
> The difference is basically that "Location Paths" server purely as
> predicates, i.e. *select* a subset of nodes from an XML fragment, while
> "Expressions" can produce node sets *or* arbitrary scalar values
> (boolean, numeric or string).
>
> According to the thread from last summer, XLST handles this by defining
> *two* constructs which evaluate XPath expressions, one for those which
> return node sets (<xsl:template match="...">) and one for those which
> return scalar values (<xsl:value-of select="...">).
>
> My patch makes XPATH() work for both nodset-returning
> *and* scalar-value-returning expressions. This has the advantage
> of being simpler, but it does force the scalar values produced
> by an XPath expression to be valid XML fragments. For boolean and
> numeric values this isn't a problem, but it does limit what you
> can do with string-returning XPath expressions.
>
> If people deem this to be a problem, we could instead add a separate
> function XPATH_VALUE() that returns VARCHAR, and make people use that
> for scalar-value-returning expressions. However, to avoid confusion,
> XPATH() should then be taught to raise an error if used for scalar-value
> returning expressions, instead of silently returning an empty array as
> it does now.
>
> Thoughts, anyone?

I think it's important to note here that the nodeset returning nature of
XPATH in XSLT is a context setting functionality: these nodes are then
further processed by the template. In the postgresql case, the
distinction between returning a value and doing further processing isn't
so clear. My one use-cases tend toward processing a table full of
records with an XML field, using the XPATH to select out fragments and
records ids into a secondary table for further processing/analysis.

What you describe, making XPATH return something for the scalar
functions, is sorely needed. Constraining the return values to be valid
XML fragments is the sort of wart that makes XML processing in
postgresql seem odd to those familiar with other tools, though. How
about naming the other function XPATH_VALUE_OF, just to make it the XSLT
connection clear?

Ross
--
Ross Reedstrom, Ph.D. reedstrm(at)rice(dot)edu
Systems Engineer & Admin, Research Scientist phone: 713-348-6166
Connexions http://cnx.org fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Ross J(dot) Reedstrom <reedstrm(at)rice(dot)edu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-05-31 19:28:50
Message-ID: 2CB0EF81-619A-464C-AA93-3A603598704D@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On May31, 2011, at 19:15 , Ross J. Reedstrom wrote:
> What you describe, making XPATH return something for the scalar
> functions, is sorely needed. Constraining the return values to be valid
> XML fragments is the sort of wart that makes XML processing in
> postgresql seem odd to those familiar with other tools, though.

I've now changes things so that the results of scalar-value
returning XPath expressions are correctly entity-encoded (i.e.,
a literal "<" gets translated to "&lt;").

After realizing that this is necessary even for node-set
returning XPath expressions (see my other mail from today),
because they may very well select text nodes, I came to the
conclusion that doing this unconditionally (well, except for
element nodes obviously) is the least surprising behaviour.

The following subsumes the behavior with this and the patch
from my other e-mail applied.

SELECT
(XPATH('namespace-uri(/*)', x))[1] AS namespace,
(XPATH('/*/@value', x))[1] AS value,
(XPATH('/*/text()', x))[1] AS text
FROM (VALUES (XMLELEMENT(name "root",
XMLATTRIBUTES('<n' AS xmlns, '<v' AS value),
'<t'
))) v(x);

namespace | value | text
-----------+-------+-------
&lt;n | &lt;v | &lt;t

Without the patch from the other mail, the "namespace" result
stays the same, but "value" and "text" are "<v" and "<t"
respectively.

Updated patch is attached

best regards,
Florian Pflug

PS: Btw, while trying this I think I found another problem. If you
do

SELECT (XPATH(
'/*',
XMLELEMENT(NAME "root",
XMLATTRIBUTES('<n' AS xmlns,
'<v' AS value))
))[1];

you get

xpath
----------------------------------
<root xmlns="<n" value="&lt;v"/>

i.e. the "<" in the namespace URI isn't quoted properly.
Trying to cast that value to text and back to xml fails.
Funnily enough, if you skip the XPATH() call, things work properly

SELECT XMLELEMENT(NAME "root",
XMLATTRIBUTES('<n' AS xmlns,
'<v' AS value))

gives

xmlelement
-------------------------------------
<root xmlns="&lt;n" value="&lt;v"/>

I'll start a new thread for this issue...

Attachment Content-Type Size
pg_xpath_returnvalue.v2.patch application/octet-stream 7.4 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-06-06 12:56:59
Message-ID: 1307365019.20678.6.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote:
> If people deem this to be a problem, we could instead add a separate
> function XPATH_VALUE() that returns VARCHAR, and make people use that
> for scalar-value-returning expressions.

Why not replicate what contrib/xml2 provides, namely

xpath_string()
xpath_number()
xpath_bool()

That way, types are preserved.

> However, to avoid confusion,
> XPATH() should then be taught to raise an error if used for
> scalar-value
> returning expressions, instead of silently returning an empty array as
> it does now.

Sounds reasonable.


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-06-08 08:14:09
Message-ID: BECD717C-8232-4427-B31D-748D9CA77A38@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun6, 2011, at 14:56 , Peter Eisentraut wrote:
> On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote:
>> If people deem this to be a problem, we could instead add a separate
>> function XPATH_VALUE() that returns VARCHAR, and make people use that
>> for scalar-value-returning expressions.
>
> Why not replicate what contrib/xml2 provides, namely
>
> xpath_string()
> xpath_number()
> xpath_bool()
>
> That way, types are preserved.

But then you lose the ability to evaluate user-supplied
XPath expressions, because there's no way of telling which of these
function to use.

Since XPATH_BOOL() can be emulated by doing XPATH(...)::text::boolean
if XPATH() implements the more lenient behaviour I proposed, that
seems like a bad tradeoff overall.

best regards,
Florian Pflug


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-06-09 23:28:21
Message-ID: 115517DA-921E-4828-8D4F-0F5C6AE92671@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun8, 2011, at 10:14 , Florian Pflug wrote:
> On Jun6, 2011, at 14:56 , Peter Eisentraut wrote:
>> On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote:
>>> If people deem this to be a problem, we could instead add a separate
>>> function XPATH_VALUE() that returns VARCHAR, and make people use that
>>> for scalar-value-returning expressions.
>>
>> Why not replicate what contrib/xml2 provides, namely
>>
>> xpath_string()
>> xpath_number()
>> xpath_bool()
>>
>> That way, types are preserved.
>
> But then you lose the ability to evaluate user-supplied
> XPath expressions, because there's no way of telling which of these
> function to use.
>
> Since XPATH_BOOL() can be emulated by doing XPATH(...)::text::boolean
> if XPATH() implements the more lenient behaviour I proposed, that
> seems like a bad tradeoff overall.

Patch rebased onto HEAD.

best regards,
Florian Pflug

Attachment Content-Type Size
pg_xpath_returnvalue.v3.patch application/octet-stream 12.9 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-06-13 19:24:49
Message-ID: 1307993090.5621.0.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On ons, 2011-06-08 at 10:14 +0200, Florian Pflug wrote:
> But then you lose the ability to evaluate user-supplied
> XPath expressions, because there's no way of telling which of these
> function to use.

Perhaps having both variants, one type-safe and one not, would work. I
don't agree with doing away with type-safety completely for the sake of
convenience.


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Bug in XPATH() if expression returns a scalar value
Date: 2011-06-14 09:56:31
Message-ID: 90EF0E6F-04BC-4CBF-9EE6-44A1CBBE56CB@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun13, 2011, at 21:24 , Peter Eisentraut wrote:
> On ons, 2011-06-08 at 10:14 +0200, Florian Pflug wrote:
>> But then you lose the ability to evaluate user-supplied
>> XPath expressions, because there's no way of telling which of these
>> function to use.
>
> Perhaps having both variants, one type-safe and one not, would work. I
> don't agree with doing away with type-safety completely for the sake of
> convenience.

In theory, I agree.

In practice, however, XPath 1.0 isn't very strongly typed itself. The
built-in types are all auto-converted into one another (As if
string(), number(), boolean() had been called). Also, only three of
the functions defined by XPath 1.0 seem interesting. Here's the break-down

The functions returning "string" are
string(): Converts arbitrary values to strings
local-name(): Name of node-set's first top-level w/o namespace prefix
namespace-uri(): Namespace of node-set's first top-level
name(): Namespace prefix and name of node-set's first top-level node
concat()
starts-with()
contains()
substring-before()
substring-after()
substring()
string-length()
translate()

For all of these function postgres provides corresponding SQL functions,
which the exception of
local-name()
namespace-uri()
name()

In fact, these three functions are the raison d'être for my patch and this thread.
I needed to find the name of a tag returned by an XPath expression, and to my
dismay discovered that XPATH('local-name(...)', ...) returns an empty array. The
only reason I added support for boolean and numeric XPath expressions at all was
for the sake of completeness.

Here's the rest of the functions defined by XPath 1.0. I'm convinces that none
of them are particularly useful as top-level functions, and therefore believe
that adding XPATH_BOOLEAN() and XPATH_NUMBER() is largely overkill.

The functions returning "number" are
number(): Converts arbitrary values to numbers
last()
position()
count()
sum(): Sum over a node-set after implicit conversion of nodes to numbers
floor()
ceiling()
round()
operators +, -, *, div, mod

The functions returning "boolean" are
boolean(): Converts arbitrary to boolean
not()
true()
false()
operators or, and, =, !=, <=, <, >=, >

best regards,
Florian Pflug