Quick Links

Re: JSON and unicode surrogate pairs

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: JSON and unicode surrogate pairs
Date:	2013-06-11 22:26:52
Message-ID:	20130611222652.GA577456@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jun 11, 2013 at 02:10:45PM -0400, Andrew Dunstan wrote:
>
> On 06/10/2013 11:22 PM, Noah Misch wrote:
>> On Mon, Jun 10, 2013 at 11:20:13AM -0400, Andrew Dunstan wrote:
>>> On 06/10/2013 10:18 AM, Tom Lane wrote:
>>>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>>>> After thinking about this some more I have come to the conclusion that
>>>>> we should only do any de-escaping of \uxxxx sequences, whether or not
>>>>> they are for BMP characters, when the server encoding is utf8. For any
>>>>> other encoding, which is already a violation of the JSON standard
>>>>> anyway, and should be avoided if you're dealing with JSON, we should
>>>>> just pass them through even in text output. This will be a simple and
>>>>> very localized fix.
>>>> Hmm. I'm not sure that users will like this definition --- it will seem
>>>> pretty arbitrary to them that conversion of \u sequences happens in some
>>>> databases and not others.
>> Yep. Suppose you have a LATIN1 database. Changing it to a UTF8 database
>> where everyone uses client_encoding = LATIN1 should not change the semantics
>> of successful SQL statements. Some statements that fail with one database
>> encoding will succeed in the other, but a user should not witness a changed
>> non-error result. (Except functions like decode() that explicitly expose byte
>> representations.) Having "SELECT '["\u00e4"]'::json ->> 0" emit '?' in the
>> UTF8 database and '\u00e4' in the LATIN1 database would move PostgreSQL in the
>> wrong direction relative to that ideal.

> As a final counter example, let me note that Postgres itself handles
> Unicode escapes differently in UTF8 databases - in other databases it
> only accepts Unicode escapes up to U+007f, i.e. ASCII characters.

I don't see a counterexample there; every database that accepts without error
a given Unicode escape produces from it the same text value. The proposal to
which I objected was akin to having non-UTF8 databases silently translate
E'\u0220' to E'\\u0220'.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

In response to

Re: JSON and unicode surrogate pairs at 2013-06-11 18:10:45 from Andrew Dunstan

Responses

Re: JSON and unicode surrogate pairs at 2013-06-11 22:58:05 from Andrew Dunstan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2013-06-11 22:53:29	Clean switchover
Previous Message	Liming Hu	2013-06-11 22:19:45	Re: request a new feature in fuzzystrmatch