Re: Another issue with invalid XML values

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Another issue with invalid XML values
Date: 2011-06-20 19:15:51
Message-ID: 9B26777E-1BFA-4432-BB77-0D93F794AB31@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

Thanks for the extensive review, it's very much appreciated!

On Jun20, 2011, at 19:57 , Noah Misch wrote:
> I tested this patch using two systems, a CentOS 5 box with
> libxml2-2.6.26-2.1.2.8.el5_5.1 and an Ubuntu 8.04 box with libxml2
> 2.6.31.dfsg-2ubuntu1.6. Both failed to build with this error:
>
> xml.c: In function `pg_xml_init':
> xml.c:934: error: `xmlStructuredErrorContext' undeclared (first use in this function)
> xml.c:934: error: (Each undeclared identifier is reported only once
> xml.c:934: error: for each function it appears in.)
> make[4]: *** [xml.o] Error 1
>
> It seems `xmlStructuredErrorContext' was added rather recently. It's not
> obvious which version added it, but 2.7.8 has it and 2.7.2 does not. Looking
> at 2.6.26's sources, that version seems to use `xmlGenericErrorContext' for
> both structured and generic handler functions. Based on that, I tried with a
> change from `xmlStructuredErrorContext' to `xmlGenericErrorContext' in our
> xml.c. I ran the test suite and hit some failures in the xml test; see
> attached regression.diffs. I received warnings where the tests expected
> nothing or expected errors.

Great :-(. I wonder if maybe I should simply remove the part of the patch
which try to restore the error handler and context in pg_xml_done(). I've
added that because I feared that some 3rd-party libraries setup their libxml
error handle just once, not before every library class. The code previously
didn't care about this either, but since we're using structured instead of
generic error reporting now, we'd now collide with 3-rd party libs which
also use structured error handling, whereas previously that would have worked.
OTOH, once there's more than one such library...

Whats your stand on this?

> Looks like my version of libxml2 (2.6.26 in this
> case) classifies some of these messages differently.

Hm, that's exactly what I hoped that this patch would *prevent* from happening..
Will look into this.

> On Thu, Jun 09, 2011 at 06:11:46PM +0200, Florian Pflug wrote:
>> On Jun2, 2011, at 01:34 , Florian Pflug wrote:
>> I also realized that some libxml error (like undefined namespace prefixes)
>> must be ignored during xmlparse() and friends. Otherwise, it becomes
>> impossible to build XML documents from individual fragments. pg_xml_init()
>> therefore now takes an argument which specifies how strict the error
>> checking is supposed to be. For the moment, only XPATH() uses the strict
>> mode in which we report all errors. XMLPARSE() and friends only report
>> parse errors, not namespace errors.
>
> Seems fair offhand. We must preserve the invariant that any object with type
> "xml" can be passed through xml_out() and then be accepted by xml_in(). I'm
> not seeing any specific risks there, but I figure I'd mention it.
>
> I couldn't reconcile this description with the code. I do see that
> xpath_internal() uses XML_PURPOSE_OTHER, but so does xmlelement(). Why is
> xmlelement() also ready for the strictest error reporting?

xmlelement() doesn't use libxml's parser, only the xml writer. I didn't
want to suppress any errors the writer might raise (I'm not sure it raises
error at all, though).

> Also note that we may be able to be more strict when xmloption = document.

Yeah, probably. I think that better done in a separate patch, though - I've
tried not to change existing behaviour with this patch except where absolutely
necessary (i.e. for XPATH)

> Ideally, an invalid namespace URI should always be an error. While namespace
> prefixes could become valid as you assemble a larger document, a namespace
> URI's validity is context-free.

I agree. That, however, would require xmlelement() to check all xmlns*
attributes, and I didn't find a way to do that with libxml() (Well,
other than to parse the element after creating it, but that's a huge
kludge). So I left that issue for a later time...

> The patch adds some trailing whitespace.

Ups, sorry for that. Will fix.

> Remembering to put a pg_xml_done() in every relevant elog(ERROR, ...) path
> seems fragile. How about using RegisterXactCallback/RegisterSubXactCallback
> in pgxml_parser_init() to handle those cases? You can also use it to Assert
> at non-abort transaction shutdown that pg_xml_done() was called as needed.

Hm, isn't that a bit fragile too? It seems that PG_TRY/PG_CATCH blocks don't
necessarily create sub-transactions, do they?

Also, most elog() paths already contained xmlFree() calls, because libxml
seemingly cannot be made to use context-based memory management. So one
already needs to be pretty careful when touching these...

Anyway, should we decide to give up on trying to restore the error handler
and context, we could get rid of pg_xml_done() again... (See above for the
details on that)

> Consider renaming XML_REPORT_ERROR it to something like XML_CHECK_ERROR,
> emphasizing that it wraps a conditional.

Hm, I think it was named XML_CHECK_ERROR at one point, and I changed it
because I felt it didn't convey the fact that it's actually ereport()
and not just return false on error. But I agree that XML_REPORT_ERROR isn't
very self-explanatory, either. What about XML_REPORT_PENDING_ERROR()
or XML_CHECK_AND_REPORT_ERROR()?

>> *************** xml_parse(text *data, XmlOptionType xmlo
>> *** 1175,1182 ****
>> int32 len;
>> xmlChar *string;
>> xmlChar *utf8string;
>> ! xmlParserCtxtPtr ctxt;
>> ! xmlDocPtr doc;
>>
>> len = VARSIZE(data) - VARHDRSZ; /* will be useful later */
>> string = xml_text2xmlChar(data);
>> --- 1244,1251 ----
>> int32 len;
>> xmlChar *string;
>> xmlChar *utf8string;
>> ! xmlParserCtxtPtr ctxt = NULL;
>> ! xmlDocPtr doc = NULL;
>>
>> len = VARSIZE(data) - VARHDRSZ; /* will be useful later */
>> string = xml_text2xmlChar(data);
>
> Is this change needed?

For "doc" it definitely is. Since we can no report an error even if
the parser returned a valid doc, there's now a "if (doc) xmlFree(doc)"
in the PG_CATCH patch.

For "ctxt", strictly, speaking no. But only because "ctxt = xmlNewParserCtxt()"
is the first line in the PG_TRY block, and therefore made the "xmlFree(ctxt)"
unconditional. But doing it this way seems more consistent and less error-prone.

> The xml_err_buf -> xml_error_detail_buf change should be its own patch, if you
> think it's important. Changing the name at the same time makes it harder to
> verify this patch.

Agreed. That name change is a left-over from a version where there were two
error buffers. Will fix.

> Is there something about this patch's other changes that make it necessary to
> strip multiple newlines, vs. one newline as we did before, and to do it in a
> different place in the code?

There previously wasn't a separate function for that. I made it strip off
multiple newlines because that allowed me to keep appendStringInfoLineSeparator()
simple while still making sure that the end result is exactly one newline
at the end of the string. Previously, the coded depended on libxml always
adding a newline at the end of messages, which I'm not even sure it true
for messages reported via the structured error handler.

> It's odd to have a purpose of "NONE" that pg_xml_init() callers can actually
> use. How about XML_PURPOSE_TRIVIAL, for callers that only call libxml2
> functions that can't error? Also, I think it would be better to require a
> call to pg_xml_done() in all cases, just to keep things simple. Actually, I
> wonder if it's worth having XML_PURPOSE_TRIVIAL for the one user thereof. It
> doesn't save much.

Yeah, that name really is a perfect example of horrible naming ;-)

The pg_xml_done()-not-required semantics are necessary though. The problem is
that parse_xml_decl() is sometimes called after the caller has already called
pg_xml_init() himself. Not always though, so one cannot simply remove the
pg_xml_init() call from parse_xml_decl() - it might then use libxml without
it being initialized. So I made pg_xml_init() skip the error-handler setup
when purpose == XML_PURPOSE_NONE. This requires then, however, that pg_xml_done()
is *not* called, because it'd restore settings that were never saved (well,
actually it'd tripe an assert that protects against that...).

But yeah, this is all quite contorted. Maybe pg_xml_init() should simply be
split into two functions, one which initialized the library and one which
sets up error handling. That means that two calls instead of one are required,
but it makes the purpose of the individual functions much clearer...

>> ! XML_PURPOSE_LEGACY /* Save error message only, don't set error flag */,
>
> It's fine to set the flag for legacy users, considering they could just not
> read it. The important distinction seems to be that we don't emit warnings or
> notices in this case. Is that correct? If so, the comment should reflect
> that emphasis. Then, consider updating the flag unconditionally.

I took me a while to remember while I did it that way, but I think I have now.

I initialled wanted to add an Assert(!xml_error_occurred) to catch any
missing XML_REPORT_ERROR() calls. I seems to have forgotten to do that,
though...

So I guess I should either refrain from setting the flag as you suggested,
or add such an Assert(). I think I very slightly prefer the latter, what
do you think?

>> ! XML_PURPOSE_WELLFORMED /* Ignore non-parser errors, nothing else*/,
>> ! XML_PURPOSE_OTHER /* Report all errors */
>> ! } XmlPurposeType;
>
> This is currently about error reporting only. How about:
>
> typedef enum
> {
> XML_ER_NONE, /* no calls to libxml2 functions permitted */
> XML_ER_WELLFORMED, /* report errors and warnings from the parser only */
> XML_ER_LEGACY, /* report all errors, no warnings */
> XML_ER_ALL /* report all errors and warnings */
> } XmlErrorReporting;

Sounds good. Will do.

>> ! extern bool xml_error_occured;
>> !
>> ! #define XML_REPORT_ERROR(assertion,level,sqlcode,msg) \
>> ! if (pg_xml_erroroccurred() || !(assertion)) { \
>> ! xml_ereport(level, sqlcode, msg); \
>> ! } \
>> ! else { \
>> ! }
>
> Consider moving this definition into xml.c. It's more of an internal
> shorthand for keeping that file simple than part of the API.

The idea was that the xml stuff in contrib/ might one day be converted
to use that macro also. But yeah, should that really happen that macro
can easily be moved back to xml.h, so will fix.

>> ! extern void pg_xml_init(XmlPurposeType purpose);
>> ! extern void pg_xml_done(void);
>> ! extern bool pg_xml_erroroccurred(void);
>> extern void xml_ereport(int level, int sqlcode, const char *msg);
>> extern xmltype *xmlconcat(List *args);
>> extern xmltype *xmlelement(XmlExprState *xmlExpr, ExprContext *econtext);
>> diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
>> index eaa5a74..da2f290 100644
>> *** a/src/test/regress/expected/xml.out
>> --- b/src/test/regress/expected/xml.out
>> *************** INSERT INTO xmltest VALUES (3, '<wrong')
>> *** 8,14 ****
>> ERROR: invalid XML content
>> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
>> ^
>> ! DETAIL: Entity: line 1: parser error : Couldn't find end of Start Tag wrong line 1
>> <wrong
>> ^
>> SELECT * FROM xmltest;
>> --- 8,14 ----
>> ERROR: invalid XML content
>> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
>> ^
>> ! DETAIL: line 1: Couldn't find end of Start Tag wrong line 1
>
> Any reason to change the error text this way?

The "Entity:" prefix is added by libxml only for messages reported
to the generic error handler. It *doesn't* refer to entities as defined
in xml, but instead used in place of the file name if libxml
doesn't have that at hand (because it's parsing from memory).

So that "Entity:" prefix really doesn't have any meaning whatsoever.

I believe that the compatibility risk is pretty low here, and removing
the meaningless prefix makes the error message are bit nicer to read.
But if you are concerned that there maybe applications out there who
parse the error text, we could of course add it back. I must say that
I don't really know what our policy regarding error message stability is...

I'll try to produce an updated patch within a day or two.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2011-06-20 19:19:04 Re: Range Types and extensions
Previous Message Kevin Grittner 2011-06-20 19:00:43 Re: [WIP] cache estimates, cache access cost