Re: upper()/lower() truncates the result under Japanese Windows

Lists: pgsql-hackers
From: Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: upper()/lower() truncates the result under Japanese Windows
Date: 2008-12-14 10:22:02
Message-ID: 4944DE4A.8050001@tpf.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Upper(), lower() or initcap() function truncates the result
under Japanese Windows with e.g. the server encoding=UTF-8
and the LC_CTYPE setting Japanese_japan.932 .

Below is an example.

$ psql
psql (8.4devel)
Type "help" for help.

inoue=# \encoding sjis

inoue=# show server_encoding;
server_encoding
-----------------
UTF8
(1 行)

inoue=# show LC_CTYPE;
lc_ctype
--------------------
Japanese_Japan.932
(1 行)

inoue=# \set jpnstr '''カタカナ'''
inoue=# select char_length(:jpnstr);
char_length
-------------
4
(1 行)

inoue=# select upper(:jpnstr);
upper
--------
カタカ
(1 行)

inoue=# select char_length(upper(:jpnstr));
char_length
-------------
3
(1 行)

The output of the last command should be 4 not 3.
Attached is a patch to fix the bug.
After applying the patch the result is

inoue=# select upper(:jpnstr);
upper
----------
カタカナ
(1 行)

inoue=# select char_length(upper(:jpnstr));
char_length
-------------
4
(1 行)

regards,
Hiroshi Inoue

Attachment Content-Type Size
formatting.patch text/plain 3.4 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: upper()/lower() truncates the result under Japanese Windows
Date: 2008-12-14 16:59:35
Message-ID: 25758.1229273975@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp> writes:
> Upper(), lower() or initcap() function truncates the result
> under Japanese Windows with e.g. the server encoding=UTF-8
> and the LC_CTYPE setting Japanese_japan.932 .

Hmm, I guess that makes sense, since the LC_CTYPE implies an encoding
other than UTF-8; MB_CUR_MAX should be set according to LC_CTYPE.

The proposed patch seems pretty ugly though. Why don't we just stop
using MB_CUR_MAX altogether? These three functions are the only
references to it AFAICS.

regards, tom lane


From: Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: upper()/lower() truncates the result under Japanese Windows
Date: 2008-12-15 22:19:30
Message-ID: 4946D7F2.3070908@tpf.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp> writes:
>> Upper(), lower() or initcap() function truncates the result
>> under Japanese Windows with e.g. the server encoding=UTF-8
>> and the LC_CTYPE setting Japanese_japan.932 .
>
> Hmm, I guess that makes sense, since the LC_CTYPE implies an encoding
> other than UTF-8; MB_CUR_MAX should be set according to LC_CTYPE.
>
> The proposed patch seems pretty ugly though. Why don't we just stop
> using MB_CUR_MAX altogether? These three functions are the only
> references to it AFAICS.

Although it looks ugly, it only follows what wchar2char() does.
Though I don't like to use MB_CUR_MAX, it seems safe as long as
wchar2char() calls wcstombs().

regards,
Hiroshi Inoue