Re: Unicode string literals versus the world

From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode string literals versus the world
Date: 2009-04-16 16:08:08
Message-ID: 20090416160808.GO12225@frubble.xen.chris-lamb.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote:
> Which hints that you can aswell enter the pairs directly: \uxx\uxx.
> If I'd be language designer, I would not see any reason to disallow it.
>
> And anyway, at least mono seems to support it:
>
> using System;
> public class HelloWorld {
> public static void Main() {
> Console.WriteLine("<\uD800\uDF02>\n");
> }
> }
>
> It will output single UTF8 character. I think this should settle it.

I don't have any .net stuff installed so can't test; but C# is defined
to use UTF-16 as its internal representation so it would make sense if
the above gets treated as a single character internally. However, if it
used any other encoding the above should be treated as an error.

> The de-facto about Postgres is stdstr=off. Even if not, E'' strings
> are still better for various things, so it would be good if they also
> aquired unicode-capabilities.

OK, this seems independent of the U&'lit' discussion that started the
thread. Note that PG already supports UTF8; if you want the character
I've been using in my examples up-thread, you can do:

SELECT E'\xF0\x90\x8C\x82';

I have a feeling that this is predicated on the server_encoding being
set to "utf8" and this can only be done at database creation time.
Another alternative would be to use the convert_from function, i.e:

SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8');

Never had to do this though, so there may be better options available.

> Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
> that is - plain 16 or 32-bit values. Seems your python is compiled with
> UCS2, not UCS4.

Cool, I didn't know that. I believe mine is UCS4 as I can do:

ord(u'\U00010302')

and I get 66306 back rather than an error.

> As I understand, in UCS2 mode it simply takes surrogate
> values as-is.

UCS2 doesn't have surrogate pairs, or at least I believe it's considered
a bug if you don't get an error when you present it with one.

> From ord() docs:
>
> If a unicode argument is given and Python was built with UCS2 Unicode,
> then the character’s code point must be in the range [0..65535]
> inclusive; otherwise the string length is two, and a TypeError will
> be raised.
>
> So only in UCS4 mode it detects surrogates and converts them to internal
> representation. (Which in Postgres case would be UTF8.)

I think you mean UTF-16 instead of UCS4; but otherwise, yes.

> Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
> UTF16 but low-level string routines do not:
>
> print "<%s>" % u'\uD800\uDF02'
>
> seems to handle it properly.

Yes, I get this as well. It's all a bit weird, which is why I was
asking if "this a bug in Python, my understanding, or something else".

When I do:

python <<EOF | hexdump -C
print u"\uD800\uDF02"
EOF

to see what it's doing I get an error which I'm not expecting, hence I
think it's probably my understanding.

--
Sam http://samason.me.uk/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-04-16 16:08:37 Re: Unicode string literals versus the world
Previous Message Tom Lane 2009-04-16 15:50:51 Re: [GENERAL] Performance of full outer join in 8.3