Re: verifying unicode locale support

From: Karel Zak <zakkr(at)zf(dot)jcu(dot)cz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Holger Klawitter <lists(at)klawitter(dot)de>, Postgres Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: verifying unicode locale support
Date: 2004-04-14 08:34:21
Message-ID: 20040414083420.GB26417@zf.jcu.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Apr 13, 2004 at 12:32:17PM -0400, Tom Lane wrote:
> Holger Klawitter <lists(at)klawitter(dot)de> writes:
> > In order to avoid interaction with gcc, cat and others else I've written a
> > new program, reading from a file.
>
> After setting up the test case and duplicating your problem, I realized
> I was being dense :-( ... this is a well-known issue. Need more
> caffeine before answering bug reports obviously ...
>
> The problem is that PG's upper() and lower() functions are based on
> the C library's <ctype.h> functions (toupper() and tolower()), which of
> course only work for single-byte character sets. So they cannot work on
> UTF8 data.
>
> There has been some talk of rewriting these functions to use the
> <wctype.h> API where available, but no one's actually stepped up to the
> plate and done it. IIRC the main sticking point was figuring out how to
> get from whatever character encoding the database is using into the wide
> character set representation the C library wants. There doesn't seem to
> be a portable way of discovering exactly what the wchar encoding is
> supposed to be for the current locale setting.

There is the "libcharset - portable character set determination.
library". But maintain this library with a lot of OS depend code is
probably nothing simple. It's used in standard iconv.

http://www.haible.de/bruno/packages-libcharset.html

But I'm not sure if it resolve something, because there is not
gaurantee of any connection between the current locale setting and
string encoding.

SELECT upper( convert('foo', 'X', 'Y') );

IMHO solution is add to "struct varlena" pointer to pg_encname that
knows handle PostgreSQL encoding information and make each PostgreSQL
string independent and self-described. Or is there something why is
this useless?

Karel

--
Karel Zak <zakkr(at)zf(dot)jcu(dot)cz>
http://home.zf.jcu.cz/~zakkr/

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Richard Huxton 2004-04-14 09:18:46 Re: performance problem aftrer update from 7.1 to 7.4.2
Previous Message Rajesh Kumar Mallah 2004-04-14 07:27:36 allowing vacuum/ analyze to operate on whole schemas.