From: | Hannu Krosing <hannu(at)2ndQuadrant(dot)com> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: JSON and unicode surrogate pairs |
Date: | 2013-06-11 13:58:02 |
Message-ID: | 51B72CEA.4000809@2ndQuadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 06/11/2013 03:42 PM, Andrew Dunstan wrote:
>
> On 06/11/2013 09:16 AM, Hannu Krosing wrote:
>
>
>>>
>>> It's a pity that we don't have a non-error producing conversion
>>> function
>>> (or if we do that I haven't found it). Then we might adopt a rule for
>>> processing
>>> unicode escapes that said "convert unicode escapes to the database
>>> encoding
>> only when extracting JSON keys or values to text makes it sense to
>> unescape
>> to database encoding.
>
> That's exactly the scenario we are talking about. When emitting JSON
> the functions have always emitted unicode escapes as they are in the
> text, and will continue to do so.
>
>>
>> strings inside JSON itself are by definition utf8
>
>
> We have deliberately extended that to allow JSON strings to be in any
> database server encoding.
Ugh!
Does that imply that we just not "allow" it, but rather "require" it ?
Why are we arguing the "unicode surrogate pairs" as a "JSON thing" then ?
Should it not be "client to server encoding conversion thing" instead ?
> That was argued back in the 9.2 timeframe and I am not interested in
> re-litigating it.
>
> The only issue at hand is how to handle unicode escapes (which in
> their string form are pure ASCII) when emitting text strings.
Unicode escapes in non-unicode strings seem something that is
ill-defined by nature ;)
That is, you can't come up with a good general answer for this.
>>> if possible, and if not then emit them unchanged." which might be a
>>> reasonable
>>> compromise.
>> I'd opt for "... and if not then emit them quoted". The default should
>> be not loosing
>> any data.
>>
>>
>>
>
>
> I don't know what this means at all. Quoted how? Let's say I have a
> Latin1 database and have the following JSON string: "\u20AC2.00". In a
> UTF8 database the text representation of this is €2.00 - what are you
> saying it should be in the Latin1 database?
utf8-quote the '€' - "\u20AC2.00"
That is, convert unicode-->Latin1 what has a correspondence, utf8-quote
anything that does not.
If we allow unicode escapes in non-unicode strings anyway, then this
seems the most logical thing to do.
>
> cheers
>
> andrew
>
>
--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ
From | Date | Subject | |
---|---|---|---|
Next Message | Stefan Drees | 2013-06-11 14:04:53 | Re: JSON and unicode surrogate pairs |
Previous Message | Andrew Dunstan | 2013-06-11 13:54:48 | Re: JSON and unicode surrogate pairs |