Bug concerning regular expressions and UTF-8

Lists: pgsql-bugs
From: Helmar Spangenberg <hspangenberg(at)frey(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Bug concerning regular expressions and UTF-8
Date: 2006-01-20 17:03:17
Message-ID: 200601201803.17706.hspangenberg@frey.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produce
a faulty result is

select 'XXXMÜLLERYyyy' ~* '.*müller.*';

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see also
discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei
Regular Expressions"). It seems that this problem does not exist in Windows
based installations.

It seems to me that this bug is originated in the file
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and
tolower(unsigned char) which definitely are the wrong choice for UTF8
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);",
which lead to the correct results of
select 'XXXMÜLLERYyyy' ~* '.*müller.*';

Since I don't have any idea concerning the side effects of this change, please
let me know as soon as an "official" patch is available - I definitely do
need regular expressions handling UTF8 correctly...

Thanks,
Helmar Spangenberg
e-mail: hspangenberg(at)frey(dot)de