Re: Another issue with invalid XML values

From: Noah Misch <noah(at)leadboat(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Another issue with invalid XML values
Date: 2011-06-20 17:57:51
Message-ID: 20110620175751.GA17037@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Florian,

I tested this patch using two systems, a CentOS 5 box with
libxml2-2.6.26-2.1.2.8.el5_5.1 and an Ubuntu 8.04 box with libxml2
2.6.31.dfsg-2ubuntu1.6. Both failed to build with this error:

xml.c: In function `pg_xml_init':
xml.c:934: error: `xmlStructuredErrorContext' undeclared (first use in this function)
xml.c:934: error: (Each undeclared identifier is reported only once
xml.c:934: error: for each function it appears in.)
make[4]: *** [xml.o] Error 1

It seems `xmlStructuredErrorContext' was added rather recently. It's not
obvious which version added it, but 2.7.8 has it and 2.7.2 does not. Looking
at 2.6.26's sources, that version seems to use `xmlGenericErrorContext' for
both structured and generic handler functions. Based on that, I tried with a
change from `xmlStructuredErrorContext' to `xmlGenericErrorContext' in our
xml.c. I ran the test suite and hit some failures in the xml test; see
attached regression.diffs. I received warnings where the tests expected
nothing or expected errors. Looks like my version of libxml2 (2.6.26 in this
case) classifies some of these messages differently.

I suggest testing the next version with both libxml2 2.6.23, the minimum
supported version, and libxml2 2.7.8, the latest version.

On Thu, Jun 09, 2011 at 06:11:46PM +0200, Florian Pflug wrote:
> On Jun2, 2011, at 01:34 , Florian Pflug wrote:
> > On Jun2, 2011, at 00:02 , Noah Misch wrote:
> >> On Wed, Jun 01, 2011 at 06:16:21PM +0200, Florian Pflug wrote:
> >>> Anyway, I'll try to come up with a patch that replaces xmlSetGenericErrorFunc() with xmlSetStructuredErrorFunc().
> >>
> >> Sounds sensible. Will this impose any new libxml2 version dependency?
> >
> > xmlSetStructuredErrorFunc() seems to be available starting with libxml 2.6.0,
> > release on Oct 20, 2003. Since we already require the version to be >= 2.6.23,
> > we should be OK.
> >
> > I won't have access to my PC the next few days, but I'll try to come up with
> > a patch some time next week.
>
> Phew... I did manage to produce a patch, but it was way more work than I
> had intended to put into this.
>
> As it turns out, you loose the nicely formatted context information that
> libxml2 provides via the generic error func once you switch to structured
> error reporting. Registering handlers for both doesn't help either, since
> the generic error handler isn't called once you register a structured one.
>
> Fortunately, libxml does export xmlParserPrintFileContext() which
> generates these context messages. It, however, doesn't return a string,
> but instead passes them to the generic error handler (this time, independent
> from whether a structural error handler is registered or not).

Interesting API, there.

> As it stood, the code assumed that all third-party library re-install
> their libxml error handlers before each library call, and thus didn't
> bother to restore the old error handler itself. Since I revamped the
> error handling anyway, I removed that requirement. There is now a
> function pg_xml_done() which restores the original error handler that
> we overwrote in pg_xml_init().

Seems reasonable.

> I also realized that some libxml error (like undefined namespace prefixes)
> must be ignored during xmlparse() and friends. Otherwise, it becomes
> impossible to build XML documents from individual fragments. pg_xml_init()
> therefore now takes an argument which specifies how strict the error
> checking is supposed to be. For the moment, only XPATH() uses the strict
> mode in which we report all errors. XMLPARSE() and friends only report
> parse errors, not namespace errors.

Seems fair offhand. We must preserve the invariant that any object with type
"xml" can be passed through xml_out() and then be accepted by xml_in(). I'm
not seeing any specific risks there, but I figure I'd mention it.

I couldn't reconcile this description with the code. I do see that
xpath_internal() uses XML_PURPOSE_OTHER, but so does xmlelement(). Why is
xmlelement() also ready for the strictest error reporting?

Also note that we may be able to be more strict when xmloption = document.

>
> Finally, I had to adjust contrib/xml2 because it uses some parts of
> the core XML support like pg_xml_init().
>
> Heres the indended behaviour with the patch applied:
> ----------------------------------------------------
>
> We always use structured error handling. For now, the error messages
> pretty much resemble the old ones, but it's now easy to add additional
> information.
>
> XMLPARSE() and casting to XML check for parse errors only, like they do
> without the patch. They're also capable of reporting warnings, but I
> didn't find a case where the libxml parser generates a warning.
>
> XPATH() reports all errors and warnings. Trying to use XPATH() on
> a document with e.g. inconsistent namespace usage or invalid
> namespace URIs therefore now raises an error. This is *necessary*
> because libxml's XPath evaluator gets confused if it encounters
> e.g. invalid namespace URI and outputs invalid XML in response.

Ideally, an invalid namespace URI should always be an error. While namespace
prefixes could become valid as you assemble a larger document, a namespace
URI's validity is context-free.

> contrib/xml2's behaviour hasn't changed.
>
> Patch is attached, and comments are welcome.

As far as the code itself, a few comments:

The patch adds some trailing whitespace.

> *** a/contrib/xml2/xpath.c
> --- b/contrib/xml2/xpath.c
> *************** xpath_table(PG_FUNCTION_ARGS)
> *** 720,725 ****
> --- 726,732 ----
> if (comppath == NULL)
> {
> xmlFreeDoc(doctree);
> + pg_xml_done();
> xml_ereport(ERROR, ERRCODE_EXTERNAL_ROUTINE_EXCEPTION,
> "XPath Syntax Error");
> }

Remembering to put a pg_xml_done() in every relevant elog(ERROR, ...) path
seems fragile. How about using RegisterXactCallback/RegisterSubXactCallback
in pgxml_parser_init() to handle those cases? You can also use it to Assert
at non-abort transaction shutdown that pg_xml_done() was called as needed.

> *************** xmlelement(XmlExprState *xmlExpr, ExprCo
> *** 597,614 ****
> }
>
> /* now safe to run libxml */
> ! pg_xml_init();
>
> PG_TRY();
> {
> buf = xmlBufferCreate();
> ! if (!buf)
> ! xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
> ! "could not allocate xmlBuffer");
> writer = xmlNewTextWriterMemory(buf, 0);
> ! if (!writer)
> ! xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY,
> ! "could not allocate xmlTextWriter");
>
> xmlTextWriterStartElement(writer, (xmlChar *) xexpr->name);
>
> --- 607,622 ----
> }
>
> /* now safe to run libxml */
> ! pg_xml_init(XML_PURPOSE_OTHER);
>
> PG_TRY();
> {
> buf = xmlBufferCreate();
> ! XML_REPORT_ERROR(buf != NULL, ERROR, ERRCODE_OUT_OF_MEMORY,
> ! "could not allocate xmlBuffer");
> writer = xmlNewTextWriterMemory(buf, 0);
> ! XML_REPORT_ERROR(writer != NULL, ERROR, ERRCODE_OUT_OF_MEMORY,
> ! "could not allocate xmlTextWriter");
>
> xmlTextWriterStartElement(writer, (xmlChar *) xexpr->name);
>

Consider renaming XML_REPORT_ERROR it to something like XML_CHECK_ERROR,
emphasizing that it wraps a conditional.

> *************** xml_parse(text *data, XmlOptionType xmlo
> *** 1175,1182 ****
> int32 len;
> xmlChar *string;
> xmlChar *utf8string;
> ! xmlParserCtxtPtr ctxt;
> ! xmlDocPtr doc;
>
> len = VARSIZE(data) - VARHDRSZ; /* will be useful later */
> string = xml_text2xmlChar(data);
> --- 1244,1251 ----
> int32 len;
> xmlChar *string;
> xmlChar *utf8string;
> ! xmlParserCtxtPtr ctxt = NULL;
> ! xmlDocPtr doc = NULL;
>
> len = VARSIZE(data) - VARHDRSZ; /* will be useful later */
> string = xml_text2xmlChar(data);

Is this change needed?

> *************** void
> *** 1337,1366 ****
> xml_ereport(int level, int sqlcode, const char *msg)
> {
> char *detail;
> !
> /*
> ! * It might seem that we should just pass xml_err_buf->data directly to
> ! * errdetail. However, we want to clean out xml_err_buf before throwing
> * error, in case there is another function using libxml further down the
> * call stack.
> */
> ! if (xml_err_buf->len > 0)
> {
> ! detail = pstrdup(xml_err_buf->data);
> ! resetStringInfo(xml_err_buf);
> }
> else
> detail = NULL;
>
> if (detail)
> {
> - size_t len;
> -
> - /* libxml error messages end in '\n'; get rid of it */
> - len = strlen(detail);
> - if (len > 0 && detail[len - 1] == '\n')
> - detail[len - 1] = '\0';
> -
> ereport(level,
> (errcode(sqlcode),
> errmsg("%s", msg),
> --- 1408,1430 ----
> xml_ereport(int level, int sqlcode, const char *msg)
> {
> char *detail;
> !
> /*
> ! * It might seem that we should just pass xml_error_detail_buf->data directly to
> ! * errdetail. However, we want to clean out xml_error_detail_buf before throwing
> * error, in case there is another function using libxml further down the
> * call stack.
> */
> ! if (xml_error_detail_buf->len > 0)
> {
> ! detail = pstrdup(xml_error_detail_buf->data);
> ! resetStringInfo(xml_error_detail_buf);

The xml_err_buf -> xml_error_detail_buf change should be its own patch, if you
think it's important. Changing the name at the same time makes it harder to
verify this patch.

> }
> else
> detail = NULL;
>
> if (detail)
> {
> ereport(level,
> (errcode(sqlcode),
> errmsg("%s", msg),
> *************** xml_ereport(int level, int sqlcode, cons
> *** 1376,1402 ****
>
>
> /*
> ! * Error handler for libxml error messages
> */
> static void
> ! xml_errorHandler(void *ctxt, const char *msg,...)
> {
> ! /* Append the formatted text to xml_err_buf */
> ! for (;;)
> {
> ! va_list args;
> ! bool success;
>
> - /* Try to format the data. */
> - va_start(args, msg);
> - success = appendStringInfoVA(xml_err_buf, msg, args);
> - va_end(args);
>
> ! if (success)
> break;
> !
> ! /* Double the buffer size and try again. */
> ! enlargeStringInfo(xml_err_buf, xml_err_buf->maxlen);
> }
> }
>
> --- 1440,1566 ----
>
>
> /*
> ! * Append a newline after removing all existing trailing newlines
> */
> static void
> ! appendStringInfoLineSeparator(StringInfo str)
> {
> ! chopStringInfoNewlines(str);
> !
> ! if (str->len > 0)
> ! appendStringInfoChar(str, '\n');
> ! }
> !
> !
> ! /*
> ! * Remove all trailing newlines
> ! */
> ! static void
> ! chopStringInfoNewlines(StringInfo str)
> ! {
> ! while ((str->len > 0) &&
> ! (str->data[str->len-1] == '\n'))
> {
> ! str->data[--str->len] = '\0';
> ! }
> ! }

Is there something about this patch's other changes that make it necessary to
strip multiple newlines, vs. one newline as we did before, and to do it in a
different place in the code?

>
>
> ! /*
> ! * Error handler for libxml errors and warnings
> ! */
> ! static void
> ! xml_errorHandler(void* data, xmlErrorPtr error)
> ! {
> ! /* Buffer to hold the error detail */
> ! StringInfo errorBuf = makeStringInfo();
> !
> ! /* Get parser and input context */
> ! xmlParserCtxtPtr ctxt = (xmlParserCtxtPtr) error->ctxt;
> ! xmlParserInputPtr input = (ctxt != NULL) ? ctxt->input : NULL;
> ! xmlNodePtr node = error->node;
> ! const xmlChar* name = ((node != NULL) && (node->type == XML_ELEMENT_NODE)) ?
> ! node->name : NULL;
> !
> ! switch (error->domain) {
> ! case XML_FROM_NONE:
> ! case XML_FROM_PARSER:
> ! case XML_FROM_MEMORY:
> ! case XML_FROM_IO:
> ! /* Accept regardless of the parsing purpose */
> ! break;
> !
> ! default:
> ! /* Ignore during well-formedness check */
> ! if (xml_purpose == XML_PURPOSE_WELLFORMED)
> ! return;
> break;
> ! }
> !
> ! /* Append error message to xml_error_detail_buf */
> ! if ((error->domain == XML_FROM_PARSER) && (error->line > 0))
> ! appendStringInfo(errorBuf, "line %d: ", error->line);
> ! if (name != NULL)
> ! appendStringInfo(errorBuf, "element %s: ", name);
> ! appendStringInfo(errorBuf, "%s", error->message);
> !
> ! /* Append context information to xml_error_detail_buf.
> ! * xmlParserPrintFileContext() uses the *generic* error handle to
> ! * write the context. Since we don't want to duplicate libxml
> ! * functionality here, we setup a generic error handler temporarily
> ! */
> ! if (input != NULL)
> ! {
> ! /* Save generic error func and setup the xml_error_detail_buf
> ! * appender as generic error func. This should work because
> ! * appendStringInfo() has essentially the same signature as
> ! * xmlGenericErrorFunc().
> ! */
> ! xmlGenericErrorFunc errFuncSaved = xmlGenericError;
> ! void* errCtxSaved = xmlGenericErrorContext;
> ! xmlSetGenericErrorFunc(errorBuf, (xmlGenericErrorFunc)appendStringInfo);
> !
> ! /* Generic context information */
> ! appendStringInfoLineSeparator(errorBuf);
> ! xmlParserPrintFileContext(input);
> !
> ! /* Restore generic error func */
> ! xmlSetGenericErrorFunc(errCtxSaved, errFuncSaved);
> ! }
> !
> ! chopStringInfoNewlines(errorBuf);
> !
> ! /* Legacy error handling. The error flag is never set. Exists because
> ! * the xml2 contrib module uses our error-handling infrastructure, but
> ! * we don't want to change its behaviour since it's deprecated anyway
> ! */
> ! if (xml_purpose == XML_PURPOSE_LEGACY)
> ! {
> ! appendStringInfoLineSeparator(xml_error_detail_buf);
> ! appendStringInfoString(xml_error_detail_buf, errorBuf->data);
> ! return;
> ! }
> !
> ! /* We don't want to ereport() here because that'd probably leave
> ! * libxml in an inconsistent state. Instead, we remember the
> ! * error and ereport() from xml_ereport().
> ! *
> ! * Warnings and notices are reported immediatly since they don't cause a

typo

> ! * longjmp() out of libxml.
> ! */
> ! if (error->level >= XML_ERR_ERROR)
> ! {
> ! appendStringInfoLineSeparator(xml_error_detail_buf);
> ! appendStringInfoString(xml_error_detail_buf, errorBuf->data);
> ! xml_error_occured = true;
> ! }
> ! else if (error->level >= XML_ERR_WARNING)
> ! {
> ! ereport(WARNING, (errmsg("%s", errorBuf->data)));
> ! }
> ! else
> ! {
> ! ereport(NOTICE, (errmsg("%s", errorBuf->data)));
> }
> }
>

> diff --git a/src/include/utils/xml.h b/src/include/utils/xml.h
> index 6359cd6..6735521 100644
> *** a/src/include/utils/xml.h
> --- b/src/include/utils/xml.h
> *************** typedef enum
> *** 68,74 ****
> XML_STANDALONE_OMITTED
> } XmlStandaloneType;
>
> ! extern void pg_xml_init(void);
> extern void xml_ereport(int level, int sqlcode, const char *msg);
> extern xmltype *xmlconcat(List *args);
> extern xmltype *xmlelement(XmlExprState *xmlExpr, ExprContext *econtext);
> --- 68,93 ----
> XML_STANDALONE_OMITTED
> } XmlStandaloneType;
>
> ! typedef enum
> ! {
> ! XML_PURPOSE_NONE /* Don't setup error handler. pg_xml_done() not required */,

It's odd to have a purpose of "NONE" that pg_xml_init() callers can actually
use. How about XML_PURPOSE_TRIVIAL, for callers that only call libxml2
functions that can't error? Also, I think it would be better to require a
call to pg_xml_done() in all cases, just to keep things simple. Actually, I
wonder if it's worth having XML_PURPOSE_TRIVIAL for the one user thereof. It
doesn't save much.

> ! XML_PURPOSE_LEGACY /* Save error message only, don't set error flag */,

It's fine to set the flag for legacy users, considering they could just not
read it. The important distinction seems to be that we don't emit warnings or
notices in this case. Is that correct? If so, the comment should reflect
that emphasis. Then, consider updating the flag unconditionally.

> ! XML_PURPOSE_WELLFORMED /* Ignore non-parser errors, nothing else*/,
> ! XML_PURPOSE_OTHER /* Report all errors */
> ! } XmlPurposeType;

This is currently about error reporting only. How about:

typedef enum
{
XML_ER_NONE, /* no calls to libxml2 functions permitted */
XML_ER_WELLFORMED, /* report errors and warnings from the parser only */
XML_ER_LEGACY, /* report all errors, no warnings */
XML_ER_ALL /* report all errors and warnings */
} XmlErrorReporting;

> !
> ! extern bool xml_error_occured;
> !
> ! #define XML_REPORT_ERROR(assertion,level,sqlcode,msg) \
> ! if (pg_xml_erroroccurred() || !(assertion)) { \
> ! xml_ereport(level, sqlcode, msg); \
> ! } \
> ! else { \
> ! }

Consider moving this definition into xml.c. It's more of an internal
shorthand for keeping that file simple than part of the API.

> !
> ! extern void pg_xml_init(XmlPurposeType purpose);
> ! extern void pg_xml_done(void);
> ! extern bool pg_xml_erroroccurred(void);
> extern void xml_ereport(int level, int sqlcode, const char *msg);
> extern xmltype *xmlconcat(List *args);
> extern xmltype *xmlelement(XmlExprState *xmlExpr, ExprContext *econtext);
> diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
> index eaa5a74..da2f290 100644
> *** a/src/test/regress/expected/xml.out
> --- b/src/test/regress/expected/xml.out
> *************** INSERT INTO xmltest VALUES (3, '<wrong')
> *** 8,14 ****
> ERROR: invalid XML content
> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
> ^
> ! DETAIL: Entity: line 1: parser error : Couldn't find end of Start Tag wrong line 1
> <wrong
> ^
> SELECT * FROM xmltest;
> --- 8,14 ----
> ERROR: invalid XML content
> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
> ^
> ! DETAIL: line 1: Couldn't find end of Start Tag wrong line 1

Any reason to change the error text this way?

Thanks,
nm

Attachment Content-Type Size
regression.diffs text/plain 4.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Darren Duncan 2011-06-20 18:38:58 Re: Range Types and extensions
Previous Message Tom Lane 2011-06-20 17:43:27 Re: Range Types and extensions