Re: verifying unicode locale support

Lists: pgsql-general
From: Holger Klawitter <lists(at)klawitter(dot)de>
To: Postgres Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: verifying unicode locale support
Date: 2004-04-13 09:11:11
Message-ID: 200404131111.16961.lists@klawitter.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi there,

triggered by the recent questions about sorting, I started digging into my
problems with upper('ä')='ä' when using LC_CTYPE and LANG = de_DE.UTF-8.

I have checked with Java (toUpperCase()) and C (see attached program, might
help others) that my locale is working, but postgres (initdb and postmaster
running with LANG=de_DE.utf8, -E UNICODE) still insists that upper('ä')
equals 'ä'. What else can be wrong?

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists <at> klawitter <dot> de

- ------snip------
#include <stdio.h>
#include <locale.h>
#include <wchar.h>

int main()
{
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return 1;
}
wchar_t* text = L"ä";
printf( "is: towupper(%x) = %x\n", text[0], towupper(text[0]) );
return 0;
}
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAe6601Xdt0HKSwgYRAvtlAJ9nfZHVHLcDeCCok/ylgr1jtZrXBQCff29h
bKiclwE2ahspLQZSBKJWIuo=
=1IaE
-----END PGP SIGNATURE-----


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Holger Klawitter <lists(at)klawitter(dot)de>
Cc: Postgres Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: verifying unicode locale support
Date: 2004-04-13 14:36:05
Message-ID: 22550.1081866965@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Holger Klawitter <lists(at)klawitter(dot)de> writes:
> I have checked with Java (toUpperCase()) and C (see attached program, might
> help others) that my locale is working, but postgres (initdb and postmaster
> running with LANG=de_DE.utf8, -E UNICODE) still insists that upper('')
> equals ''. What else can be wrong?

What byte string are you really entering here? What's coming through in
your email is \344 ... which is not valid UTF8. But I suspect something
may have translated it before it got to my inbox.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Holger Klawitter <lists(at)klawitter(dot)de>, Postgres Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: verifying unicode locale support
Date: 2004-04-13 15:02:46
Message-ID: 200404131702.46039.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Holger Klawitter wrote:
> I have checked with Java (toUpperCase()) and C (see attached program,
> might help others) that my locale is working, but postgres (initdb
> and postmaster running with LANG=de_DE.utf8, -E UNICODE) still
> insists that upper('ä') equals 'ä'. What else can be wrong?

PostgreSQL, case conversion, and Unicode don't work together. Pick any
two. :-)


From: Holger Klawitter <lists(at)klawitter(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Postgres Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: verifying unicode locale support
Date: 2004-04-13 15:55:43
Message-ID: 200404131752.43382.lists@klawitter.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> What byte string are you really entering here? What's coming through in
> your email is \344 ... which is not valid UTF8. But I suspect something
> may have translated it before it got to my inbox.

Damn charsets :-) The character indeed was \344 aka "&auml;", but my mailer
sends latin, not unicode.

In order to avoid interaction with gcc, cat and others else I've written a new
program, reading from a file.
gcc -o unicode unicode.c
LC_CTYPE=de_DE.utf8 ./unicode uni.data
should yield (xterm -u8, LC_CTYPE=en_US.utf8 works as well)
uni.out

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists <at> klawitter <dot> de

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAfA1/1Xdt0HKSwgYRAhldAJoCcNrZ7BGnG1m2SXX/lR1ngqGooQCcDYOF
SlzlbLAJk7/e6rzYZyL7yE4=
=/3bH
-----END PGP SIGNATURE-----

Attachment Content-Type Size
unicode-testcase.tar.gz application/x-tgz 549 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Holger Klawitter <lists(at)klawitter(dot)de>
Cc: Postgres Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: verifying unicode locale support
Date: 2004-04-13 16:32:17
Message-ID: 23711.1081873937@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Holger Klawitter <lists(at)klawitter(dot)de> writes:
> In order to avoid interaction with gcc, cat and others else I've written a
> new program, reading from a file.

After setting up the test case and duplicating your problem, I realized
I was being dense :-( ... this is a well-known issue. Need more
caffeine before answering bug reports obviously ...

The problem is that PG's upper() and lower() functions are based on
the C library's <ctype.h> functions (toupper() and tolower()), which of
course only work for single-byte character sets. So they cannot work on
UTF8 data.

There has been some talk of rewriting these functions to use the
<wctype.h> API where available, but no one's actually stepped up to the
plate and done it. IIRC the main sticking point was figuring out how to
get from whatever character encoding the database is using into the wide
character set representation the C library wants. There doesn't seem to
be a portable way of discovering exactly what the wchar encoding is
supposed to be for the current locale setting.

If you're interested in trying to fix this, check the pgsql-hackers
archives for the previous discussions. Searching for "wctype" would
probably find the relevant threads.

If you just want to get your work done, I'd suggest adopting a
single-byte encoding such as Latin1 for the database.

regards, tom lane


From: Karel Zak <zakkr(at)zf(dot)jcu(dot)cz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Holger Klawitter <lists(at)klawitter(dot)de>, Postgres Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: verifying unicode locale support
Date: 2004-04-14 08:34:21
Message-ID: 20040414083420.GB26417@zf.jcu.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Apr 13, 2004 at 12:32:17PM -0400, Tom Lane wrote:
> Holger Klawitter <lists(at)klawitter(dot)de> writes:
> > In order to avoid interaction with gcc, cat and others else I've written a
> > new program, reading from a file.
>
> After setting up the test case and duplicating your problem, I realized
> I was being dense :-( ... this is a well-known issue. Need more
> caffeine before answering bug reports obviously ...
>
> The problem is that PG's upper() and lower() functions are based on
> the C library's <ctype.h> functions (toupper() and tolower()), which of
> course only work for single-byte character sets. So they cannot work on
> UTF8 data.
>
> There has been some talk of rewriting these functions to use the
> <wctype.h> API where available, but no one's actually stepped up to the
> plate and done it. IIRC the main sticking point was figuring out how to
> get from whatever character encoding the database is using into the wide
> character set representation the C library wants. There doesn't seem to
> be a portable way of discovering exactly what the wchar encoding is
> supposed to be for the current locale setting.

There is the "libcharset - portable character set determination.
library". But maintain this library with a lot of OS depend code is
probably nothing simple. It's used in standard iconv.

http://www.haible.de/bruno/packages-libcharset.html

But I'm not sure if it resolve something, because there is not
gaurantee of any connection between the current locale setting and
string encoding.

SELECT upper( convert('foo', 'X', 'Y') );

IMHO solution is add to "struct varlena" pointer to pg_encname that
knows handle PostgreSQL encoding information and make each PostgreSQL
string independent and self-described. Or is there something why is
this useless?

Karel

--
Karel Zak <zakkr(at)zf(dot)jcu(dot)cz>
http://home.zf.jcu.cz/~zakkr/