Regexps vs. locale

Lists: pgsql-hackers
From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Regexps vs. locale
Date: 2008-12-08 08:11:58
Message-ID: 87ljurozld.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This came up on irc:

postgres=# show lc_ctype;
lc_ctype
-------------
fr_FR.UTF-8

postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)

postgres=# select E'\303\201' ILIKE E'\303\241';
?column?
----------
t
(1 row)

postgres=# select E'\303\201' ~* E'\303\241';
?column?
----------
f
(1 row)

Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.

Why is this? It does not appear to be a documented restriction.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Regexps vs. locale
Date: 2008-12-08 13:18:56
Message-ID: 23275.1228742336@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> Obviously, this happens because the locale support functions in
> backend/regex/regc_locale.c are (presumably intentionally) crippled so
> as not to support non-ascii chars, despite all the code there using
> wide chars for everything otherwise.

It's not so much intentional as that no one has gotten around to making
it work. The difficulty is that the wide-char codes we are using might
not match what the <wctype.h> functions expect, and it's unclear what
we could do to fix that.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Regexps vs. locale
Date: 2008-12-08 17:39:31
Message-ID: 87vdtuo9bg.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
>> Obviously, this happens because the locale support functions in
>> backend/regex/regc_locale.c are (presumably intentionally)
>> crippled so as not to support non-ascii chars, despite all the
>> code there using wide chars for everything otherwise.

Tom> It's not so much intentional as that no one has gotten around to
Tom> making it work. The difficulty is that the wide-char codes we
Tom> are using might not match what the <wctype.h> functions expect,
Tom> and it's unclear what we could do to fix that.

Couldn't we follow the example of lower(), and convert the string to
wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?

This obviously requires that we have a matching lc_ctype for the
encoding, but we insist on that now anyway, no?

--
Andrew.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Regexps vs. locale
Date: 2008-12-10 17:52:32
Message-ID: 4189.1228931552@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> It's not so much intentional as that no one has gotten around to
> Tom> making it work. The difficulty is that the wide-char codes we
> Tom> are using might not match what the <wctype.h> functions expect,
> Tom> and it's unclear what we could do to fix that.

> Couldn't we follow the example of lower(), and convert the string to
> wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?

Possibly. I think we did not have the char2wchar() infrastructure
when the regexp stuff was last gone over, so it might be more practical
to do that now.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Regexps vs. locale
Date: 2009-01-07 04:44:24
Message-ID: 200901070444.n074iOM19932@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO:

Add ability to use case-insensitive regular expressions on multi-byte
characters

ILIKE already works with multi-byte characters

* http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php

---------------------------------------------------------------------------

Andrew Gierth wrote:
> This came up on irc:
>
> postgres=# show lc_ctype;
> lc_ctype
> -------------
> fr_FR.UTF-8
>
> postgres=# show server_encoding;
> server_encoding
> -----------------
> UTF8
> (1 row)
>
> postgres=# select E'\303\201' ILIKE E'\303\241';
> ?column?
> ----------
> t
> (1 row)
>
> postgres=# select E'\303\201' ~* E'\303\241';
> ?column?
> ----------
> f
> (1 row)
>
> Obviously, this happens because the locale support functions in
> backend/regex/regc_locale.c are (presumably intentionally) crippled so
> as not to support non-ascii chars, despite all the code there using
> wide chars for everything otherwise.
>
> Why is this? It does not appear to be a documented restriction.
>
> --
> Andrew (irc:RhodiumToad)
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +