XML Issue with DTDs

Lists: pgsql-hackers
From: Florian Pflug <fgp(at)phlo(dot)org>
To: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: XML Issue with DTDs
Date: 2013-12-19 23:40:11
Message-ID: 8E3B4E77-5539-431A-9E14-CAC3AD9938A3@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and de-escapes them (i.e. replaces entity references by their text equivalent), I've ran into another issue with the XML type.

XML values can either contain a DOCUMENT or CONTENT. In the first case, the value is well-formed XML according to the XML specification. In the latter case, the value is a collection of nodes, each of which may contain children. Without DTDs in the mix, CONTENT is thus a generalization of DOCUMENT, i.e. a DOCUMENT may contain only a single root node while a CONTENT may contain multiple. That guarantees that a concatenation of two XML values is always at least valid CONTENT. That, however, is no longer true once DTDs enter the picture. A DOCUMENT may contain a DTD as long as it precedes the root node (processing instructions and comments may precede the DTD, though). Yet CONTENT may not include a DTD at all. A concatenation of a DOCUMENT with a DTD and CONTENT thus yields something that is neither a DOCUMENT nor a CONTENT, yet XMLCONCAT fails to complain. The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.

select xmlconcat(
xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
xmlparse(content '<test/>')
)::text::xml;

Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.

We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTION setting. That means either

(1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to XMLAGG

or

(2) Complaining if these values contain a DTD.

or

(3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.

I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently handle XML declarations (<?xml …?>), so I'm slightly in favour of that.

Thoughts?

best regards,
Florian Pflug


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML Issue with DTDs
Date: 2013-12-20 17:52:39
Message-ID: CA+TgmoZWQ=tkeijPmF4d-CQ_4jHo=gw1DnST8Pr_HcPjN0goUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and de-escapes them (i.e. > Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.

Or we can just decide that it was a bug that this was ever allowed,
and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
This is roughly what we did with encoding checks.

> We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTION setting. That means either
>
> (1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to XMLAGG
>
> or
>
> (2) Complaining if these values contain a DTD.
>
> or
>
> (3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.
>
> I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently handle XML declarations (<?xml …?>), so I'm slightly in favour of that.

I don't like #3, mostly because I don't like XMLOPTION ANY in the
first place. Either #1 or #2 sounds OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML Issue with DTDs
Date: 2013-12-21 01:16:56
Message-ID: EF895053-B7CC-4900-B33B-4358B0B3C795@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.
>
> Or we can just decide that it was a bug that this was ever allowed,
> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
> This is roughly what we did with encoding checks.

What exactly do you suggest we outlaw? If there are XML values which
are CONTENT but not a DOCUMENT, and other values which are a DOCUMENT
but not CONTENT, then what is pg_restore supposed to set XMLOPTION
to?

best regards,
Florian Pflug


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML Issue with DTDs
Date: 2013-12-23 02:45:02
Message-ID: CA+TgmoZNHVhco-kQPvQRvCv+s9SR5sYkzmexO5aAMKkLBsUfqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.
>>
>> Or we can just decide that it was a bug that this was ever allowed,
>> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
>> This is roughly what we did with encoding checks.
>
> What exactly do you suggest we outlaw?

<!DOCTYPE> anywhere but at the beginning.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML Issue with DTDs
Date: 2013-12-23 17:39:37
Message-ID: 52B87559.7090003@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/19/13, 6:40 PM, Florian Pflug wrote:
> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.
>
> select xmlconcat(
> xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
> xmlparse(content '<test/>')
> )::text::xml;

The SQL standard specifies that DTDs are dropped by xmlconcat. It's
just not implemented.


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML Issue with DTDs
Date: 2013-12-26 20:28:45
Message-ID: AE499D25-0910-4CFD-AF98-D6103918495E@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec23, 2013, at 03:45 , Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.
>>>
>>> Or we can just decide that it was a bug that this was ever allowed,
>>> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
>>> This is roughly what we did with encoding checks.
>>
>> What exactly do you suggest we outlaw?
>
> <!DOCTYPE> anywhere but at the beginning.

I think we're talking past one another here. Fixing XMLCONCAT/XMLAGG
to not produce XML values which are neither valid DOCUMENTS nor valid
CONTENT fixes *one* part of the problem.

The other part of the problem is that since not every DOCUMENT
is valid CONTENT (because CONTENT forbids DTDs) and not every CONTENT
is a valid DOCUMENT (because DOCUMENT forbids multiple root nodes), it's
impossible to set XMLOPTION to a value which accepts *all* valid XML
values. That breaks pg_dump/pg_restore. To fix this, we must provide
a way to insert XML data which accepts both DOCUMENTS and CONTENT, and
not only one or the other. Due to the way COPY works, we cannot call
a special conversion function, so we must modify the input functions.

My initial thought was to simply allow XML values which are CONTENT,
not DOCUMENTS, to contain a DTD (at the beginning), thus making CONTENT
a superset of DOCUMENT. But I've since then realized that the 2003
standard explicitly constrains CONTENT to *not* contain a DTD. The
only other option that I can see is to invert a third, non-standard
XMLOPTION value, ANY. ANY would accept anything accepted by either
DOCUMENT or CONTENT, but no more than that.

best regards,
Florian Pflug


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML Issue with DTDs
Date: 2013-12-26 20:30:00
Message-ID: 73B8C774-BE59-4F8C-9CF8-3D63D74C9EEF@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec23, 2013, at 18:39 , Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On 12/19/13, 6:40 PM, Florian Pflug wrote:
>> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.
>>
>> select xmlconcat(
>> xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
>> xmlparse(content '<test/>')
>> )::text::xml;
>
> The SQL standard specifies that DTDs are dropped by xmlconcat. It's
> just not implemented.

OK, cool, I'll try to figure out how to do that with libxml

best regards,
Florian Pflug


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML Issue with DTDs
Date: 2013-12-29 23:33:56
Message-ID: 41B1A9EC-029D-4CED-9830-5E7E42CAA960@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec26, 2013, at 21:30 , Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Dec23, 2013, at 18:39 , Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>> On 12/19/13, 6:40 PM, Florian Pflug wrote:
>>> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.
>>>
>>> select xmlconcat(
>>> xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
>>> xmlparse(content '<test/>')
>>> )::text::xml;
>>
>> The SQL standard specifies that DTDs are dropped by xmlconcat. It's
>> just not implemented.
>
> OK, cool, I'll try to figure out how to do that with libxml

Hm, I've read through the (draft) SQL/XML 2003 standard, and it seems that
it mandates more processing of DTDs than we currently do. In particular, it
says that attribute default values and custom entities are to be expanded
by xmlparse(). Without doing that, stripping the DTD can change the meaning
of an XML document, or make it not well-formed (in the case of custom
entity definitions). So I think that we unless we implement that, I we have
to raise an error, not silently strip the DTD.

best regards,
Florian Pflug