From: | Marko Kreen <markokr(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Sam Mason <sam(at)samason(dot)me(dot)uk>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Unicode string literals versus the world |
Date: | 2009-04-16 15:50:30 |
Message-ID: | e51f66da0904160850p36636d7dja68e6280d77f00f1@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 4/16/09, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Sam Mason <sam(at)samason(dot)me(dot)uk> writes:
> > I'd never heard of UTF-16 surrogate pairs before this discussion and
> > hence didn't realise that it's valid to have a surrogate pair in place
> > of a single code point. The docs say that <D800 DF02> corresponds to
> > U+10302, Python would appear to follow my intuitions in that:
>
> > ord(u'\uD800\uDF02')
>
> > results in an error instead of giving back 66306, as I'd expect. Is
> > this a bug in Python, my understanding, or something else?
>
>
> I might be wrong, but I think surrogate pairs are expressly forbidden in
> all representations other than UTF16/UCS2. We definitely forbid them
> when validating UTF-8 strings --- that's per an RFC recommendation.
> It sounds like Python is doing the same.
The point here is that Python/Java/C# allow them for escaping non-BMP
unicode values, irrespective of their interal encoding.
--
marko
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2009-04-16 15:50:51 | Re: [GENERAL] Performance of full outer join in 8.3 |
Previous Message | Robert Haas | 2009-04-16 15:36:54 | Re: [GENERAL] Performance of full outer join in 8.3 |