Quick Links

chr() is still too loose about UTF8 code points

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	chr() is still too loose about UTF8 code points
Date:	2014-05-16 15:05:08
Message-ID:	4709.1400252708@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Quite some time ago, we made the chr() function accept Unicode code points
up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8
string. It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF (for
compatibility with UTF16). While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
f1
----

(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf
CONTEXT: COPY tt, line 1
LOCATION: report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10ffff. Should we back-patch that, or just do it in HEAD?

regards, tom lane

Responses

Re: chr() is still too loose about UTF8 code points at 2014-05-16 16:43:55 from Heikki Linnakangas
Re: chr() is still too loose about UTF8 code points at 2014-05-16 17:39:09 from Noah Misch

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Joshua D. Drake	2014-05-16 15:13:04	Re: pg_basebackup: could not get transaction log end position from server: FATAL: could not open file "./pg_hba.conf~": Permission denied
Previous Message	Amit Kapila	2014-05-16 14:51:16	Re: Scaling shared buffer eviction