Re: JSON and unicode surrogate pairs

From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: JSON and unicode surrogate pairs
Date: 2013-06-11 13:58:02
Message-ID: 51B72CEA.4000809@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/11/2013 03:42 PM, Andrew Dunstan wrote:
>
> On 06/11/2013 09:16 AM, Hannu Krosing wrote:
>
>
>>>
>>> It's a pity that we don't have a non-error producing conversion
>>> function
>>> (or if we do that I haven't found it). Then we might adopt a rule for
>>> processing
>>> unicode escapes that said "convert unicode escapes to the database
>>> encoding
>> only when extracting JSON keys or values to text makes it sense to
>> unescape
>> to database encoding.
>
> That's exactly the scenario we are talking about. When emitting JSON
> the functions have always emitted unicode escapes as they are in the
> text, and will continue to do so.
>
>>
>> strings inside JSON itself are by definition utf8
>
>
> We have deliberately extended that to allow JSON strings to be in any
> database server encoding.
Ugh!

Does that imply that we just not "allow" it, but rather "require" it ?

Why are we arguing the "unicode surrogate pairs" as a "JSON thing" then ?

Should it not be "client to server encoding conversion thing" instead ?

> That was argued back in the 9.2 timeframe and I am not interested in
> re-litigating it.
>
> The only issue at hand is how to handle unicode escapes (which in
> their string form are pure ASCII) when emitting text strings.
Unicode escapes in non-unicode strings seem something that is
ill-defined by nature ;)

That is, you can't come up with a good general answer for this.
>>> if possible, and if not then emit them unchanged." which might be a
>>> reasonable
>>> compromise.
>> I'd opt for "... and if not then emit them quoted". The default should
>> be not loosing
>> any data.
>>
>>
>>
>
>
> I don't know what this means at all. Quoted how? Let's say I have a
> Latin1 database and have the following JSON string: "\u20AC2.00". In a
> UTF8 database the text representation of this is €2.00 - what are you
> saying it should be in the Latin1 database?

utf8-quote the '€' - "\u20AC2.00"

That is, convert unicode-->Latin1 what has a correspondence, utf8-quote
anything that does not.

If we allow unicode escapes in non-unicode strings anyway, then this
seems the most logical thing to do.

>
> cheers
>
> andrew
>
>

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Drees 2013-06-11 14:04:53 Re: JSON and unicode surrogate pairs
Previous Message Andrew Dunstan 2013-06-11 13:54:48 Re: JSON and unicode surrogate pairs