XML Issue with DTDs

From: Florian Pflug <fgp(at)phlo(dot)org>
To: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: XML Issue with DTDs
Date: 2013-12-19 23:40:11
Message-ID: 8E3B4E77-5539-431A-9E14-CAC3AD9938A3@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and de-escapes them (i.e. replaces entity references by their text equivalent), I've ran into another issue with the XML type.

XML values can either contain a DOCUMENT or CONTENT. In the first case, the value is well-formed XML according to the XML specification. In the latter case, the value is a collection of nodes, each of which may contain children. Without DTDs in the mix, CONTENT is thus a generalization of DOCUMENT, i.e. a DOCUMENT may contain only a single root node while a CONTENT may contain multiple. That guarantees that a concatenation of two XML values is always at least valid CONTENT. That, however, is no longer true once DTDs enter the picture. A DOCUMENT may contain a DTD as long as it precedes the root node (processing instructions and comments may precede the DTD, though). Yet CONTENT may not include a DTD at all. A concatenation of a DOCUMENT with a DTD and CONTENT thus yields something that is neither a DOCUMENT nor a CONTENT, yet XMLCONCAT fails to complain. The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.

select xmlconcat(
xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
xmlparse(content '<test/>')
)::text::xml;

Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.

We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTION setting. That means either

(1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to XMLAGG

or

(2) Complaining if these values contain a DTD.

or

(3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.

I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently handle XML declarations (<?xml …?>), so I'm slightly in favour of that.

Thoughts?

best regards,
Florian Pflug

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Smith 2013-12-20 01:23:25 Re: gaussian distribution pgbench
Previous Message Adrian Klaver 2013-12-19 23:19:32 Re: pg_upgrade & tablespaces