Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Bernd Helmle <mailings(at)oopsware(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-18 19:00:43
Message-ID: CA+TgmoZgKCRLTMd+vmAJh3Dgfayz4=GG19yue8U9nc-ckxK_Tw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
> <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> I'll try to submit a revised patch within the next couple days.
>
> Sorry this is later than I said.
>
> I addressed the issues covered in the review.  I also fixed a bug
> where "\u0022" would become """, which is invalid JSON, causing an
> assertion failure.
>
> However, I want to put this back into WIP for a number of reasons:
>
>  * The current code accepts invalid surrogate pairs (e.g.
> "\uD800\uD800").  The problem with accepting them is that it would be
> inconsistent with PostgreSQL's Unicode support, and with the Unicode
> standard itself.  In my opinion: as long as the server encoding is
> universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
> (barring data corruption and resource limitations).
>
>  * I'd like to go ahead with the parser rewrite I mentioned earlier.
> The new parser will be able to construct a parse tree when needed, and
> it won't use those overkill parsing macros.
>
>  * I recently learned that not all supported server encodings can be
> converted to Unicode losslessly.  The current code, on output,
> converts non-ASCII characters to Unicode escapes under some
> circumstances (see the comment above json_need_to_escape_unicode).
>
> I'm having a really hard time figuring out how the JSON module should
> handle non-Unicode character sets.  \uXXXX escapes in JSON literals
> can be used to encode characters not available in the server encoding.
>  On the other hand, the server encoding can encode characters not
> present in Unicode (see the third bullet point above).  This means
> JSON normalization and comparison (along with member lookup) are not
> possible in general.

I previously suggested that, instead of trying to implement JSON, you
should just try to implement
JSON-without-the-restriction-that-everything-must-be-UTF8. Most
people are going to be using UTF-8 simply because it's the default,
and if you forget about transcoding then ISTM that this all becomes a
lot simpler. We don't, in general, have the ability to support data
in multiple encodings inside PostgreSQL, and it seems to me that by
trying to invent a mechanism for making that work as part of this
patch, you are setting the bar for yourself awfully high.

One thing to think about here is that transcoding between UTF-8 and
the server encoding seems like the wrong thing all around. After all,
the user does not want the data in the server encoding; they want it
in their chosen client encoding. If you are transcoding between UTF-8
and the server encoding, then that suggests that there's some
double-transcoding going on here, which creates additional
opportunities for (1) inefficiency and (2) outright failure. I'm
guessing that's because you're dealing with an interface that expects
the internal representation of the datum on one side and the server
encoding on the other side, which gets back to the point in the
preceding paragraph. You'd probably need to revise that interface in
order to make this really work the way it should, and that might be
more than you want to get into. At any rate, it probably is a
separate project from making JSON work.

If in spite of the above you're bent on continuing down your present
course, then it seems to me that you'd better make the on-disk
representation UTF-8, with all \uXXXX escapes converted to the
corresponding characters. If you hit an invalid surrogate pair, or a
character that exists in the server encoding but not UTF-8, it's not a
legal JSON object and you throw an error on input, just as you would
for mismatched braces or similar. On output, you should probably just
use \uXXXX to represent any unrepresentable characters - i.e. option 3
from your original list. That may be slow, but I think that it's not
really worth devoting a lot of mental energy to this case. Most
people are going to be using UTF-8 because that's the default, and
those who are not shouldn't expect a data format built around UTF-8 to
work perfectly in their environment, especially if they insist on
using characters that are representable in only some of the encodings
they are using.

But, again, why not just forget about transcoding and define it as
"JSON, if you happen to be using utf-8 as the server encoding, and
otherwise some variant of JSON that uses the server encoding as its
native format?". It seems to me that that would be a heck of a lot
simpler and more reliable, and I'm not sure it's any less useful in
practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-07-18 19:06:40 Re: Reduced power consumption in autovacuum launcher process
Previous Message Tom Lane 2011-07-18 18:56:44 Re: patch for 9.2: enhanced errors