Re: Another issue with invalid XML values

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Another issue with invalid XML values
Date: 2011-06-22 10:47:42
Message-ID: 8F927318-CA92-42CD-996F-3ED71C3E51E2@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

Updated patch attached.

On Jun20, 2011, at 22:45 , Noah Misch wrote:
> On Mon, Jun 20, 2011 at 09:15:51PM +0200, Florian Pflug wrote:
>> On Jun20, 2011, at 19:57 , Noah Misch wrote:
>>> I tested this patch using two systems, a CentOS 5 box with
>>> libxml2-2.6.26-2.1.2.8.el5_5.1 and an Ubuntu 8.04 box with libxml2
>>> 2.6.31.dfsg-2ubuntu1.6. Both failed to build with this error:
>>>
>>> xml.c: In function `pg_xml_init':
>>> xml.c:934: error: `xmlStructuredErrorContext' undeclared (first use in this function)
>>> xml.c:934: error: (Each undeclared identifier is reported only once
>>> xml.c:934: error: for each function it appears in.)
>>> make[4]: *** [xml.o] Error 1
>>>
>>> It seems `xmlStructuredErrorContext' was added rather recently. It's not
>>> obvious which version added it, but 2.7.8 has it and 2.7.2 does not. Looking
>>> at 2.6.26's sources, that version seems to use `xmlGenericErrorContext' for
>>> both structured and generic handler functions. Based on that, I tried with a
>>> change from `xmlStructuredErrorContext' to `xmlGenericErrorContext' in our
>>> xml.c. I ran the test suite and hit some failures in the xml test; see
>>> attached regression.diffs. I received warnings where the tests expected
>>> nothing or expected errors.
>>
>> Great :-(. I wonder if maybe I should simply remove the part of the patch
>> which try to restore the error handler and context in pg_xml_done(). I've
>> added that because I feared that some 3rd-party libraries setup their libxml
>> error handle just once, not before every library class. The code previously
>> didn't care about this either, but since we're using structured instead of
>> generic error reporting now, we'd now collide with 3-rd party libs which
>> also use structured error handling, whereas previously that would have worked.
>> OTOH, once there's more than one such library...
>>
>> Whats your stand on this?
>
> Assuming it's sufficient to save/restore xmlStructuredErrorContext when it's
> available and xmlGenericErrorContext otherwise, I would just add an Autoconf
> check to identify which one to use.

It seems that before xmlStructuredErrorContext was introduced, the structured
and the generic error handler shared an error context, so doing what you
suggested looks sensible.

I don't have any autoconf-fu whatsoever, though, so I'm not 100% certain I did
this the right way. I basically copied a similar check (for setlongjump AFAIR)
and adapted it to check for xmlStructuredErrorContext instead.

>>> Consider renaming XML_REPORT_ERROR it to something like XML_CHECK_ERROR,
>>> emphasizing that it wraps a conditional.
>>
>> Hm, I think it was named XML_CHECK_ERROR at one point, and I changed it
>> because I felt it didn't convey the fact that it's actually ereport()
>> and not just return false on error. But I agree that XML_REPORT_ERROR isn't
>> very self-explanatory, either. What about XML_REPORT_PENDING_ERROR()
>> or XML_CHECK_AND_REPORT_ERROR()?
>
> XML_CHECK_AND_REPORT_ERROR() seems fine. Or XML_CHECK_AND_EREPORT()?

I ended up picking XML_CHECK_AND_EREPORT().

>>> It's odd to have a purpose of "NONE" that pg_xml_init() callers can actually
>>> use. How about XML_PURPOSE_TRIVIAL, for callers that only call libxml2
>>> functions that can't error? Also, I think it would be better to require a
>>> call to pg_xml_done() in all cases, just to keep things simple. Actually, I
>>> wonder if it's worth having XML_PURPOSE_TRIVIAL for the one user thereof. It
>>> doesn't save much.
>>
>> Yeah, that name really is a perfect example of horrible naming ;-)
>>
>> The pg_xml_done()-not-required semantics are necessary though. The problem is
>> that parse_xml_decl() is sometimes called after the caller has already called
>> pg_xml_init() himself. Not always though, so one cannot simply remove the
>> pg_xml_init() call from parse_xml_decl() - it might then use libxml without
>> it being initialized. So I made pg_xml_init() skip the error-handler setup
>> when purpose == XML_PURPOSE_NONE. This requires then, however, that pg_xml_done()
>> is *not* called, because it'd restore settings that were never saved (well,
>> actually it'd tripe an assert that protects against that...).
>>
>> But yeah, this is all quite contorted. Maybe pg_xml_init() should simply be
>> split into two functions, one which initialized the library and one which
>> sets up error handling. That means that two calls instead of one are required,
>> but it makes the purpose of the individual functions much clearer...
>
> Oh, tricky. That's an option, and the error-handler-initialization function
> could just call the more-primitive one, so most callers would not need to
> know. Another way is to make xml_error_initialized a counter. If it's called
> a second time, increment and do nothing more. Only unregister the handler
> when it's decremented to zero. No strong preference here.

After some pondering, I picked your first suggestion, i.e. factored
pg_xml_init_library() out of pg_xml_init() and made the one caller who
used XML_PURPOSE_NONE use that instead().

>>>> ! XML_PURPOSE_LEGACY /* Save error message only, don't set error flag */,
>>>
>>> It's fine to set the flag for legacy users, considering they could just not
>>> read it. The important distinction seems to be that we don't emit warnings or
>>> notices in this case. Is that correct? If so, the comment should reflect
>>> that emphasis. Then, consider updating the flag unconditionally.
>>
>> I took me a while to remember while I did it that way, but I think I have now.
>>
>> I initialled wanted to add an Assert(!xml_error_occurred) to catch any
>> missing XML_REPORT_ERROR() calls. I seems to have forgotten to do that,
>> though...
>>
>> So I guess I should either refrain from setting the flag as you suggested,
>> or add such an Assert(). I think I very slightly prefer the latter, what
>> do you think?
>
> I do like the idea of that assert. How about setting the flag anyway, but
> making it "Assert(xml_purpose == XML_PURPOSE_LEGACY || !xml_error_occurred)"?

I added the Assert, but didn't make the setting of the error flag unconditional.

If I did that, XML_CHECK_AND_EREPORT() would stop working for the LEGACY
case. Now, that case currently isn't exercised, but breaking nevertheless
seemed wrong to me.

>>>> *** a/src/test/regress/expected/xml.out
>>>> --- b/src/test/regress/expected/xml.out
>>>> *************** INSERT INTO xmltest VALUES (3, '<wrong')
>>>> *** 8,14 ****
>>>> ERROR: invalid XML content
>>>> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
>>>> ^
>>>> ! DETAIL: Entity: line 1: parser error : Couldn't find end of Start Tag wrong line 1
>>>> <wrong
>>>> ^
>>>> SELECT * FROM xmltest;
>>>> --- 8,14 ----
>>>> ERROR: invalid XML content
>>>> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
>>>> ^
>>>> ! DETAIL: line 1: Couldn't find end of Start Tag wrong line 1
>>>
>>> Any reason to change the error text this way?
>>
>> The "Entity:" prefix is added by libxml only for messages reported
>> to the generic error handler. It *doesn't* refer to entities as defined
>> in xml, but instead used in place of the file name if libxml
>> doesn't have that at hand (because it's parsing from memory).
>>
>> So that "Entity:" prefix really doesn't have any meaning whatsoever.
>
> So xmlParserPrintFileContext() sends different content to the generic error
> handler from what "natural" calls to the handler would send?

xmlParserPrintFileContext() only generates the "context" part of the error
message. In the example above, these are the two lines
<wrong
^
These lines don't contain the "Entity:" prefix - neither with the patch
attached nor without.

(Btw: Yeah, with generic error handling, libxml sometimes calls
the error handler multiple times for a single error, each time passing
parts of the whole error message. These "generic error handlers" really
aren't error handlers at all, but rather message output callbacks...)

>> I believe that the compatibility risk is pretty low here, and removing
>> the meaningless prefix makes the error message are bit nicer to read.
>> But if you are concerned that there maybe applications out there who
>> parse the error text, we could of course add it back. I must say that
>> I don't really know what our policy regarding error message stability is...
>
> If the libxml2 API makes it a major pain to preserve exact messages, I
> wouldn't worry.

It'd only require us to prepend "Entity: " to the message string, so
there's no pain there. The question is rather whether we want to continue
having a pure noise word in front of every libxml error message for
the sake of compatibility.

I vote "Nay", on the grounds that I estimate the actual breakage from
such a change to be approximately zero. Plus the fact that libxml
error messages aren't completely stable either. I had to suppress the
DETAIL output for one of the regression tests to make them work for
both 2.6.23 and 2.7.8; libxml chooses to quote an invalid namespace
URI in it's error message in one of these versions but not the other...

best regards,
Florian Pflug

Attachment Content-Type Size
pg_xml_errorhandling.v2.patch application/octet-stream 51.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-06-22 10:55:35 Re: [COMMITTERS] pgsql: Make the visibility map crash-safe.
Previous Message Noah Misch 2011-06-22 10:18:08 Re: [v9.2] DROP Reworks Part.0 - 'missing_ok' support of get_object_address