Re: Notes about fixing regexes and UTF-8 (yet again)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Notes about fixing regexes and UTF-8 (yet again)
Date: 2012-02-19 04:41:55
Message-ID: 13378.1329626515@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Vik Reykja <vikreykja(at)gmail(dot)com> writes:
> On Sun, Feb 19, 2012 at 05:03, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja(at)gmail(dot)com> wrote:
>>> Does it make sense for regexps to have collations?

>> As I understand it, collations determine the sort-ordering of strings.
>> Regular expressions don't care about that. Why do you ask?

> Perhaps I used the wrong term, but I was thinking the locale could tell us
> what alphabet we're dealing with. So a regexp using en_US would give
> different word-boundary results from one using zh_CN.

Our interpretation of a "collation" is that it sets both LC_COLLATE and
LC_CTYPE. Regexps may not care about the first but they definitely care
about the second. This is why the stuff in regc_pg_locale.c pays
attention to collation.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-02-19 04:49:10 Re: Future of our regular expression code
Previous Message Robert Haas 2012-02-19 04:17:51 Re: Initial 9.2 pgbench write results