Lists: | pgsql-hackers |
---|
From: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Regexps vs. locale |
Date: | 2008-12-08 08:11:58 |
Message-ID: | 87ljurozld.fsf@news-spur.riddles.org.uk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
This came up on irc:
postgres=# show lc_ctype;
lc_ctype
-------------
fr_FR.UTF-8
postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)
postgres=# select E'\303\201' ILIKE E'\303\241';
?column?
----------
t
(1 row)
postgres=# select E'\303\201' ~* E'\303\241';
?column?
----------
f
(1 row)
Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.
Why is this? It does not appear to be a documented restriction.
--
Andrew (irc:RhodiumToad)
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Regexps vs. locale |
Date: | 2008-12-08 13:18:56 |
Message-ID: | 23275.1228742336@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> Obviously, this happens because the locale support functions in
> backend/regex/regc_locale.c are (presumably intentionally) crippled so
> as not to support non-ascii chars, despite all the code there using
> wide chars for everything otherwise.
It's not so much intentional as that no one has gotten around to making
it work. The difficulty is that the wide-char codes we are using might
not match what the <wctype.h> functions expect, and it's unclear what
we could do to fix that.
regards, tom lane
From: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Regexps vs. locale |
Date: | 2008-12-08 17:39:31 |
Message-ID: | 87vdtuo9bg.fsf@news-spur.riddles.org.uk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
>> Obviously, this happens because the locale support functions in
>> backend/regex/regc_locale.c are (presumably intentionally)
>> crippled so as not to support non-ascii chars, despite all the
>> code there using wide chars for everything otherwise.
Tom> It's not so much intentional as that no one has gotten around to
Tom> making it work. The difficulty is that the wide-char codes we
Tom> are using might not match what the <wctype.h> functions expect,
Tom> and it's unclear what we could do to fix that.
Couldn't we follow the example of lower(), and convert the string to
wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?
This obviously requires that we have a matching lc_ctype for the
encoding, but we insist on that now anyway, no?
--
Andrew.
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Regexps vs. locale |
Date: | 2008-12-10 17:52:32 |
Message-ID: | 4189.1228931552@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> It's not so much intentional as that no one has gotten around to
> Tom> making it work. The difficulty is that the wide-char codes we
> Tom> are using might not match what the <wctype.h> functions expect,
> Tom> and it's unclear what we could do to fix that.
> Couldn't we follow the example of lower(), and convert the string to
> wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?
Possibly. I think we did not have the char2wchar() infrastructure
when the regexp stuff was last gone over, so it might be more practical
to do that now.
regards, tom lane
From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Regexps vs. locale |
Date: | 2009-01-07 04:44:24 |
Message-ID: | 200901070444.n074iOM19932@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Added to TODO:
Add ability to use case-insensitive regular expressions on multi-byte
characters
ILIKE already works with multi-byte characters
* http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php
---------------------------------------------------------------------------
Andrew Gierth wrote:
> This came up on irc:
>
> postgres=# show lc_ctype;
> lc_ctype
> -------------
> fr_FR.UTF-8
>
> postgres=# show server_encoding;
> server_encoding
> -----------------
> UTF8
> (1 row)
>
> postgres=# select E'\303\201' ILIKE E'\303\241';
> ?column?
> ----------
> t
> (1 row)
>
> postgres=# select E'\303\201' ~* E'\303\241';
> ?column?
> ----------
> f
> (1 row)
>
> Obviously, this happens because the locale support functions in
> backend/regex/regc_locale.c are (presumably intentionally) crippled so
> as not to support non-ascii chars, despite all the code there using
> wide chars for everything otherwise.
>
> Why is this? It does not appear to be a documented restriction.
>
> --
> Andrew (irc:RhodiumToad)
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +