From: | Sam Mason <sam(at)samason(dot)me(dot)uk> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Unicode string literals versus the world |
Date: | 2009-04-16 16:08:08 |
Message-ID: | 20090416160808.GO12225@frubble.xen.chris-lamb.co.uk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote:
> Which hints that you can aswell enter the pairs directly: \uxx\uxx.
> If I'd be language designer, I would not see any reason to disallow it.
>
> And anyway, at least mono seems to support it:
>
> using System;
> public class HelloWorld {
> public static void Main() {
> Console.WriteLine("<\uD800\uDF02>\n");
> }
> }
>
> It will output single UTF8 character. I think this should settle it.
I don't have any .net stuff installed so can't test; but C# is defined
to use UTF-16 as its internal representation so it would make sense if
the above gets treated as a single character internally. However, if it
used any other encoding the above should be treated as an error.
> The de-facto about Postgres is stdstr=off. Even if not, E'' strings
> are still better for various things, so it would be good if they also
> aquired unicode-capabilities.
OK, this seems independent of the U&'lit' discussion that started the
thread. Note that PG already supports UTF8; if you want the character
I've been using in my examples up-thread, you can do:
SELECT E'\xF0\x90\x8C\x82';
I have a feeling that this is predicated on the server_encoding being
set to "utf8" and this can only be done at database creation time.
Another alternative would be to use the convert_from function, i.e:
SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8');
Never had to do this though, so there may be better options available.
> Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
> that is - plain 16 or 32-bit values. Seems your python is compiled with
> UCS2, not UCS4.
Cool, I didn't know that. I believe mine is UCS4 as I can do:
ord(u'\U00010302')
and I get 66306 back rather than an error.
> As I understand, in UCS2 mode it simply takes surrogate
> values as-is.
UCS2 doesn't have surrogate pairs, or at least I believe it's considered
a bug if you don't get an error when you present it with one.
> From ord() docs:
>
> If a unicode argument is given and Python was built with UCS2 Unicode,
> then the character’s code point must be in the range [0..65535]
> inclusive; otherwise the string length is two, and a TypeError will
> be raised.
>
> So only in UCS4 mode it detects surrogates and converts them to internal
> representation. (Which in Postgres case would be UTF8.)
I think you mean UTF-16 instead of UCS4; but otherwise, yes.
> Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
> UTF16 but low-level string routines do not:
>
> print "<%s>" % u'\uD800\uDF02'
>
> seems to handle it properly.
Yes, I get this as well. It's all a bit weird, which is why I was
asking if "this a bug in Python, my understanding, or something else".
When I do:
python <<EOF | hexdump -C
print u"\uD800\uDF02"
EOF
to see what it's doing I get an error which I'm not expecting, hence I
think it's probably my understanding.
--
Sam http://samason.me.uk/
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2009-04-16 16:08:37 | Re: Unicode string literals versus the world |
Previous Message | Tom Lane | 2009-04-16 15:50:51 | Re: [GENERAL] Performance of full outer join in 8.3 |