Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)

Lists: pgsql-bugs
From: albert(dot)cieszkowski(at)cc(dot)com(dot)pl
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
Date: 2012-02-14 13:20:43
Message-ID: E1RxIJH-000504-Ti@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 6457
Logged by: Albert Cieszkowski
Email address: albert(dot)cieszkowski(at)cc(dot)com(dot)pl
PostgreSQL version: 9.0.6
Operating system: CentOS 5.x
Description:

OS, base and client encoding UTF-8:

peimp=> select 'Świnoujście' ~* '\mŚwinoujście\M';
?column?
----------
f
(1 row)

peimp=> select 'Świnoujście' ~* '\AŚwinoujście\Z';
?column?
----------
t
(1 row)

but:

peimp=> select 'Mróz' ~* '\mmróZ\M';
?column?
----------
t
(1 row)

peimp=> select 'Mróz' ~* '\AmróZ\Z';
?column?
----------
t
(1 row)

I believe it is connected with bug #5766 and #3433.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: albert(dot)cieszkowski(at)cc(dot)com(dot)pl
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
Date: 2012-02-14 15:27:05
Message-ID: 7784.1329233225@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

albert(dot)cieszkowski(at)cc(dot)com(dot)pl writes:
> OS, base and client encoding UTF-8:

What's your lc_collate/lc_ctype settings?

regards, tom lane


From: Albert Cieszkowski <albert(dot)cieszkowski(at)cc(dot)com(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
Date: 2012-02-14 15:39:52
Message-ID: 4F3A8048.1030501@cc.com.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

<html>
<head>

<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hello Tom,<br>
<br>
Every lc_x value is pl_PL.UTF8 (corresponding to the word's
language).<br>
Database was created with --locale=pl_PL.UTF8.<br>
OS (CentOS 5.x) uses: en_US.UTF-8<br>
<br>
Best regards,<br>
Albert Cieszkowski<br>
<br>
W dniu 2012-02-14 16:27, Tom Lane pisze:
<blockquote class=" cite" id="mid_7784_1329233225_sss_pgh_pa_us"
cite="mid:7784(dot)1329233225(at)sss(dot)pgh(dot)pa(dot)us" type="cite">
<pre wrap=""><a class="moz-txt-link-abbreviated" href="mailto:albert(dot)cieszkowski(at)cc(dot)com(dot)pl">albert(dot)cieszkowski(at)cc(dot)com(dot)pl</a> writes:
</pre>
<blockquote class=" cite" id="Cite_0" type="cite">
<pre wrap="">OS, base and client encoding UTF-8:
</pre>
</blockquote>
<pre wrap="">What's your lc_collate/lc_ctype settings?

regards, tom lane

</pre>
</blockquote>
</body>
</html>

Attachment Content-Type Size
unknown_filename text/html 1005 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: albert(dot)cieszkowski(at)cc(dot)com(dot)pl
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
Date: 2012-02-14 18:28:11
Message-ID: 11041.1329244091@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

albert(dot)cieszkowski(at)cc(dot)com(dot)pl writes:
> peimp=> select 'winoujcie' ~* '\mwinoujcie\M';
> ?column?
> ----------
> f
> (1 row)

Oh, I see the reason for this: the code in cclass() in regc_locale.c
doesn't go further up than U+00FF, so no codes above that will be
thought to be letters (or members of any other character class).
Clearly we need to go further when we are dealing with UTF8.
I'm not sure what a sane limit would be though.

(It would be nice if there were a more efficient way to get this
information than laboriously iterating through all the possible
character codes. It doesn't look like we're even trying to cache
the results, ick.)

regards, tom lane


From: Duncan Rance <duncan(at)dunquino(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: albert(dot)cieszkowski(at)cc(dot)com(dot)pl, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
Date: 2012-02-15 09:18:56
Message-ID: 4EFDC1C4-F0C7-42A3-859C-6E98867CC614@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 14 Feb 2012, at 18:28, Tom Lane wrote:
>
> Oh, I see the reason for this: the code in cclass() in regc_locale.c
> doesn't go further up than U+00FF, so no codes above that will be
> thought to be letters (or members of any other character class).
> Clearly we need to go further when we are dealing with UTF8.
> I'm not sure what a sane limit would be though.

The Basic Multilingual Plane goes up to FFFF:

https://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Planes


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: albert(dot)cieszkowski(at)cc(dot)com(dot)pl, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)
Date: 2012-02-15 09:21:13
Message-ID: 35CBD9EE-B188-4FD2-B1D6-2576B06D3BC4@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 14 Feb 2012, at 18:28, Tom Lane wrote:
>
> Oh, I see the reason for this: the code in cclass() in regc_locale.c
> doesn't go further up than U+00FF, so no codes above that will be
> thought to be letters (or members of any other character class).
> Clearly we need to go further when we are dealing with UTF8.
> I'm not sure what a sane limit would be though.

The Basic Multilingual Plane goes up to FFFF:

https://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Planes