Quick Links

Re: chr() is still too loose about UTF8 code points

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: chr() is still too loose about UTF8 code points
Date:	2014-05-16 17:11:08
Message-ID:	537646AC.2020201@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 05/16/2014 12:43 PM, Heikki Linnakangas wrote:
> On 05/16/2014 06:05 PM, Tom Lane wrote:
>> Quite some time ago, we made the chr() function accept Unicode code
>> points
>> up to U+1FFFFF, which is the largest value that will fit in a 4-byte
>> UTF8
>> string. It was pointed out to me though that RFC3629 restricted the
>> original definition of UTF8 to only allow code points up to U+10FFFF
>> (for
>> compatibility with UTF16). While that might not be something we feel we
>> need to follow exactly, pg_utf8_islegal implements the checking
>> algorithm
>> specified by RFC3629, and will therefore reject points above U+10FFFF.
>>
>> This means you can use chr() to create values that will be rejected on
>> dump and reload:
>>
>> u8=# create table tt (f1 text);
>> CREATE TABLE
>> u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
>> INSERT 0 1
>> u8=# select * from tt;
>> f1
>> ----
>>
>> (1 row)
>>
>> u8=# \copy tt to 'junk'
>> COPY 1
>> u8=# \copy tt from 'junk'
>> ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf
>> 0xbf 0xbf
>> CONTEXT: COPY tt, line 1
>> LOCATION: report_invalid_encoding, wchar.c:2011
>>
>> I think this probably means we need to change chr() to reject code
>> points
>> above 10ffff. Should we back-patch that, or just do it in HEAD?
>
> +1 for back-patching. A value that cannot be restored is bad, and I
> can't imagine any legitimate use case for producing a Unicode
> character larger than U+10FFFF with chr(x), when the rest of the
> system doesn't handle it. Fully supporting such values might be
> useful, but that's a different story.
>
>

My understanding us that U+10FFFF is the highest legal Unicode code
point anyway. So this is really just tightening our routines to make
sure we don't produce an invalid value. We won't be disallowing anything
that is legal Unicode.

cheers

andrew

In response to

Re: chr() is still too loose about UTF8 code points at 2014-05-16 16:43:55 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David G Johnston	2014-05-16 17:11:58	Re: pg_basebackup: could not get transaction log end position from server: FATAL: could not open file "./pg_hba.conf~": Permission denied
Previous Message	Joshua D. Drake	2014-05-16 17:01:48	Re: pg_basebackup: could not get transaction log end position from server: FATAL: could not open file "./pg_hba.conf~": Permission denied