Quick Links

Re: Notes about fixing regexes and UTF-8 (yet again)

Lists:	pgsql-hackers

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-15 23:06:36
Message-ID:	24241.1329347196@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

In bug #6457 it's pointed out that we *still* don't have full
functionality for locale-dependent regexp behavior with UTF8 encoding.
The reason is that there's old crufty code in regc_locale.c that only
considers character codes up to 255 when searching for characters that
should be considered "letters", "digits", etc. We could fix that, for
some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
UTF8 encoding, but the time that would take is unappealing. Especially
so considering that this code is executed afresh anytime we compile a
regex that requires locale knowledge.

I looked into the upstream Tcl code and observed that they deal with
this by having hard-wired tables of which Unicode code points are to be
considered letters etc. The tables are directly traceable to the
Unicode standard (they provide a script to regenerate them from files
available from unicode.org). Nonetheless, I do not find that approach
appealing, mainly because we'd be risking deviating from the libc locale
code's behavior within regexes when we follow it everywhere else.
It seems entirely likely to me that a particular locale setting might
consider only some of what Unicode says are letters to be letters.

However, we could possibly compromise by using Unicode-derived tables
as a guide to which code points are worth probing libc for. That is,
assume that a utf8-based locale will never claim that some code is a
letter that unicode.org doesn't think is a letter. That would cut the
number of required probes by a pretty large factor.

The other thing that seems worth doing is to install some caching.
We could presumably assume that the behavior of iswupper() et al are
fixed for the duration of a database session, so that we only need to
run the probe loop once when first asked to create a cvec for a
particular category.

Thoughts, better ideas?

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-17 08:48:50
Message-ID:	4F3E1472.6080403@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16.02.2012 01:06, Tom Lane wrote:
> In bug #6457 it's pointed out that we *still* don't have full
> functionality for locale-dependent regexp behavior with UTF8 encoding.
> The reason is that there's old crufty code in regc_locale.c that only
> considers character codes up to 255 when searching for characters that
> should be considered "letters", "digits", etc. We could fix that, for
> some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
> UTF8 encoding, but the time that would take is unappealing. Especially
> so considering that this code is executed afresh anytime we compile a
> regex that requires locale knowledge.
>
> I looked into the upstream Tcl code and observed that they deal with
> this by having hard-wired tables of which Unicode code points are to be
> considered letters etc. The tables are directly traceable to the
> Unicode standard (they provide a script to regenerate them from files
> available from unicode.org). Nonetheless, I do not find that approach
> appealing, mainly because we'd be risking deviating from the libc locale
> code's behavior within regexes when we follow it everywhere else.
> It seems entirely likely to me that a particular locale setting might
> consider only some of what Unicode says are letters to be letters.
>
> However, we could possibly compromise by using Unicode-derived tables
> as a guide to which code points are worth probing libc for. That is,
> assume that a utf8-based locale will never claim that some code is a
> letter that unicode.org doesn't think is a letter. That would cut the
> number of required probes by a pretty large factor.
>
> The other thing that seems worth doing is to install some caching.
> We could presumably assume that the behavior of iswupper() et al are
> fixed for the duration of a database session, so that we only need to
> run the probe loop once when first asked to create a cvec for a
> particular category.
>
> Thoughts, better ideas?

Here's a wild idea: keep the class of each codepoint in a hash table.
Initialize it with all codepoints up to 0xFFFF. After that, whenever a
string contains a character that's not in the hash table yet, query the
class of that character, and add it to the hash table. Then recompile
the whole regex and restart the matching engine.

Recompiling is expensive, but if you cache the results for the session,
it would probably be acceptable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-17 14:39:43
Message-ID:	25151.1329489583@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Here's a wild idea: keep the class of each codepoint in a hash table.
> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
> string contains a character that's not in the hash table yet, query the
> class of that character, and add it to the hash table. Then recompile
> the whole regex and restart the matching engine.

> Recompiling is expensive, but if you cache the results for the session,
> it would probably be acceptable.

Dunno ... recompiling is so expensive that I can't see this being a win;
not to mention that it would require fundamental surgery on the regex
code.

In the Tcl implementation, no codepoints above U+FFFF have any locale
properties (alpha/digit/punct/etc), period. Personally I'd not have a
problem imposing the same limitation, so that dealing with stuff above
that range isn't really a consideration anyway.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-17 14:56:05
Message-ID:	4F3E6A85.3090606@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/17/2012 09:39 AM, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Here's a wild idea: keep the class of each codepoint in a hash table.
>> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
>> string contains a character that's not in the hash table yet, query the
>> class of that character, and add it to the hash table. Then recompile
>> the whole regex and restart the matching engine.
>> Recompiling is expensive, but if you cache the results for the session,
>> it would probably be acceptable.
> Dunno ... recompiling is so expensive that I can't see this being a win;
> not to mention that it would require fundamental surgery on the regex
> code.
>
> In the Tcl implementation, no codepoints above U+FFFF have any locale
> properties (alpha/digit/punct/etc), period. Personally I'd not have a
> problem imposing the same limitation, so that dealing with stuff above
> that range isn't really a consideration anyway.

up to U+FFFF is the BMP which is described as containing "characters for
almost all modern languages, and a large number of special characters."
It seems very likely to be acceptable not to bother about the locale of
code points in the supplementary planes.

See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions
of which sets of characters are involved.

cheers

andrew

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-17 15:12:22
Message-ID:	CA+TgmoZEDfak7tSUZw8bGjGBGMTjW2tnxc+P-1sv3Ldaq3V=Hw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Here's a wild idea: keep the class of each codepoint in a hash table.
> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
> string contains a character that's not in the hash table yet, query the
> class of that character, and add it to the hash table. Then recompile the
> whole regex and restart the matching engine.
>
> Recompiling is expensive, but if you cache the results for the session, it
> would probably be acceptable.

What if you did this ONCE and wrote the results to a file someplace?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-17 15:19:43
Message-ID:	26105.1329491983@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Recompiling is expensive, but if you cache the results for the session, it
>> would probably be acceptable.

> What if you did this ONCE and wrote the results to a file someplace?

That's still a cache, you've just defaulted on your obligation to think
about what conditions require the cache to be flushed. (In the case at
hand, the trigger for a cache rebuild would probably need to be a glibc
package update, which we have no way of knowing about.)

Before going much further with this, we should probably do some timings
of 64K calls of iswupper and friends, just to see how bad a dumb
implementation will be.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-17 17:09:25
Message-ID:	CA+TgmoY1wXqtuHaq+T1M-2MsU=w4c7wpiVg1SzJVQVT-dqZjZw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> What if you did this ONCE and wrote the results to a file someplace?
>
> That's still a cache, you've just defaulted on your obligation to think
> about what conditions require the cache to be flushed.

Yep. Unfortunately, I don't have a good idea how to handle that; I
was hoping someone else did.

> Before going much further with this, we should probably do some timings
> of 64K calls of iswupper and friends, just to see how bad a dumb
> implementation will be.

Can't hurt.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-17 23:06:19
Message-ID:	6555.1329519979@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Before going much further with this, we should probably do some timings
>> of 64K calls of iswupper and friends, just to see how bad a dumb
>> implementation will be.

> Can't hurt.

The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503)
running Fedora 16 in en_US.utf8 locale, is that 64K iterations of
pg_wc_isalpha or sibling functions requires a shade under 2ms.
So this definitely justifies caching the values to avoid computing
them more than once per session, but I'm not convinced there are
grounds for trying harder than that.

BTW, I am also a bit surprised to find out that this locale considers
48342 of those characters to satisfy isalpha(). Seems like a heck of
a lot. But anyway we can forget my idea of trying to save work by
incorporating a-priori assumptions about which Unicode codepoints are
which --- it'll be faster to just iterate through them all, at least
for that case. Maybe we should hard-wire some cases like digits, not
sure.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-18 02:17:27
Message-ID:	12169.1329531447@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503)
> running Fedora 16 in en_US.utf8 locale, is that 64K iterations of
> pg_wc_isalpha or sibling functions requires a shade under 2ms.
> So this definitely justifies caching the values to avoid computing
> them more than once per session, but I'm not convinced there are
> grounds for trying harder than that.

And here's a poorly-tested draft patch for that.

regards, tom lane

Attachment	Content-Type	Size
fix-regex-locales-one-more-time.patch	text/x-patch	13.3 KB

From:	NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp>
To:	pgsql-hackers(at)postgreSQL(dot)org
Cc:	NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp>
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-18 09:29:57
Message-ID:	E4F0A52A-AA30-40CB-86A4-D795AB33DC64@staff.kanazawa-u.ac.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I don't believe it is valid to ignore CJK characters above U+20000.
If it is used for names, it will be stored in the database.
If the behaviour is different from characters below U+FFFF, you will
get a bug report in meanwhile.

see
CJK Extension B, C, and D
from
http://www.unicode.org/charts/

Also, there are some code points that could be regarded alphabet and numbers
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols

On the other hand, it is ok if processing of characters above U+10000 is very slow,
as far as properly processed, because it is considered rare.

On 2012/02/17, at 23:56, Andrew Dunstan wrote:

>
>
> On 02/17/2012 09:39 AM, Tom Lane wrote:
>> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> Here's a wild idea: keep the class of each codepoint in a hash table.
>>> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
>>> string contains a character that's not in the hash table yet, query the
>>> class of that character, and add it to the hash table. Then recompile
>>> the whole regex and restart the matching engine.
>>> Recompiling is expensive, but if you cache the results for the session,
>>> it would probably be acceptable.
>> Dunno ... recompiling is so expensive that I can't see this being a win;
>> not to mention that it would require fundamental surgery on the regex
>> code.
>>
>> In the Tcl implementation, no codepoints above U+FFFF have any locale
>> properties (alpha/digit/punct/etc), period. Personally I'd not have a
>> problem imposing the same limitation, so that dealing with stuff above
>> that range isn't really a consideration anyway.
>
>
> up to U+FFFF is the BMP which is described as containing "characters for almost all modern languages, and a large number of special characters." It seems very likely to be acceptable not to bother about the locale of code points in the supplementary planes.
>
> See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions of which sets of characters are involved.
>
>
> cheers
>
> andrew
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-18 17:15:14
Message-ID:	28618.1329585314@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp> writes:
> I don't believe it is valid to ignore CJK characters above U+20000.
> If it is used for names, it will be stored in the database.
> If the behaviour is different from characters below U+FFFF, you will
> get a bug report in meanwhile.

I am skeptical that there is enough usage of such things to justify
slowing regexp operations down for everybody. Note that it's not only
the initial probe of libc behavior that's at stake here --- the more
character codes are treated as letters, the larger the DFA transition
maps get and the more time it takes to build them. So I'm unexcited
about just cranking up the loop limit in pg_ctype_get_cache.

> On the other hand, it is ok if processing of characters above U+10000
> is very slow, as far as properly processed, because it is considered
> rare.

Yeah, it's conceivable that we could implement something whereby
characters with codes above some cutoff point are handled via runtime
calls to iswalpha() and friends, rather than being included in the
statically-constructed DFA maps. The cutoff point could likely be a lot
less than U+FFFF, too, thereby saving storage and map build time all
round.

However, that "we" above is the editorial "we". *I* am not going to
do this. Somebody who actually has a need for it should step up.

regards, tom lane

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-18 23:01:37
Message-ID:	m2hayng2zy.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Yeah, it's conceivable that we could implement something whereby
> characters with codes above some cutoff point are handled via runtime
> calls to iswalpha() and friends, rather than being included in the
> statically-constructed DFA maps. The cutoff point could likely be a lot
> less than U+FFFF, too, thereby saving storage and map build time all
> round.

It's been proposed to build a “regexp” type in PostgreSQL which would
store the DFA directly and provides some way to run that DFA out of its
“storage” without recompiling.

Would such a mechanism be useful here? Would it be useful only when
storing the regexp in a column somewhere then applying it in the query
from there (so most probably adding a join or subquery somewhere)?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-18 23:45:10
Message-ID:	7392.1329608710@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps. The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.

> It's been proposed to build a regexp type in PostgreSQL which would
> store the DFA directly and provides some way to run that DFA out of its
> storage without recompiling.

> Would such a mechanism be useful here?

No, this is about what goes into the DFA representation in the first
place, not about how we store it and reuse it.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 00:29:33
Message-ID:	8291.1329611373@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> And here's a poorly-tested draft patch for that.

I've done some more testing now, and am satisfied that this works as
intended. However, some crude performance testing suggests that people
might be annoyed with it. As an example, in 9.1 with pl_PL.utf8 locale,
I see this:
select 'aaaaaaaaaa' ~ '\w\w\w\w\w\w\w\w\w\w\w';
taking perhaps 0.75 ms on first execution and 0.4 ms on subsequent
executions, the difference being the time needed to compile and cache
the DFA representation of the regexp. With the patch, the numbers are
more like 5 ms and 0.4 ms, meaning the compilation time has gone up by
something near a factor of 10, though AFAICT execution time hasn't
moved. It's hard to tell how significant that would be to real-world
queries, but in the worst case where our caching of regexps doesn't help
much, it could be disastrous.

All of the extra time is in manipulation of the much larger number of
DFA arcs required to represent all the additional character codes that
are being considered to be letters.

Perhaps I'm being overly ASCII-centric, but I'm afraid to commit this
as-is; I think the number of people who are hurt by the performance
degradation will be greatly larger than the number who are glad because
characters in $random_alphabet are now seen to be letters. I think an
actually workable solution will require something like what I speculated
about earlier:

> Yeah, it's conceivable that we could implement something whereby
> characters with codes above some cutoff point are handled via runtime
> calls to iswalpha() and friends, rather than being included in the
> statically-constructed DFA maps. The cutoff point could likely be a lot
> less than U+FFFF, too, thereby saving storage and map build time all
> round.

In the meantime, I still think the caching logic is worth having, and
we could at least make some people happy if we selected a cutoff point
somewhere between U+FF and U+FFFF. I don't have any strong ideas about
what a good compromise cutoff would be. One possibility is U+7FF, which
corresponds to the limit of what fits in 2-byte UTF8; but I don't know
if that corresponds to any significant dropoff in frequency of usage.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 03:33:07
Message-ID:	CA+TgmoZtsKi8oQyjn=HMQW8JYnhYmDjzSwAA1-uZNWZBRK+g-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps. The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.
>
> In the meantime, I still think the caching logic is worth having, and
> we could at least make some people happy if we selected a cutoff point
> somewhere between U+FF and U+FFFF. I don't have any strong ideas about
> what a good compromise cutoff would be. One possibility is U+7FF, which
> corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> if that corresponds to any significant dropoff in frequency of usage.

The problem, of course, is that this probably depends quite a bit on
what language you happen to be using. For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+FFFF might not be enough. So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.

In theory you can imagine a regular expression engine where these
decisions can be postponed until we see the string we're matching
against. IOW, your DFA ends up with state transitions for characters
specifically named, plus a state transition for "anything else that's
a letter", plus a state transition for "anything else not otherwise
specified". Then you only need to test the letters that actually
appear in the target string, rather than all of the ones that might
appear there.

But implementing that could be quite a lot of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Vik Reykja <vikreykja(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 03:38:31
Message-ID:	CALDgxVtk41fkTcF+24b1DbytbwD=kO+K-HGbMyOwjT45TRRkiQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 19, 2012 at 04:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Yeah, it's conceivable that we could implement something whereby
> >> characters with codes above some cutoff point are handled via runtime
> >> calls to iswalpha() and friends, rather than being included in the
> >> statically-constructed DFA maps. The cutoff point could likely be a lot
> >> less than U+FFFF, too, thereby saving storage and map build time all
> >> round.
> >
> > In the meantime, I still think the caching logic is worth having, and
> > we could at least make some people happy if we selected a cutoff point
> > somewhere between U+FF and U+FFFF. I don't have any strong ideas about
> > what a good compromise cutoff would be. One possibility is U+7FF, which
> > corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> > if that corresponds to any significant dropoff in frequency of usage.
>
> The problem, of course, is that this probably depends quite a bit on
> what language you happen to be using. For some languages, it won't
> matter whether you cut it off at U+FF or U+7FF; while for others even
> U+FFFF might not be enough. So I think this is one of those cases
> where it's somewhat meaningless to talk about frequency of usage.
>

Does it make sense for regexps to have collations?

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Vik Reykja <vikreykja(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 04:03:55
Message-ID:	CA+Tgmoa0hJMbga=KjBFgr4veAMGfK9gDNvH-NCEDp4nmaXTJJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja(at)gmail(dot)com> wrote:
> Does it make sense for regexps to have collations?

As I understand it, collations determine the sort-ordering of strings.
Regular expressions don't care about that. Why do you ask?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Vik Reykja <vikreykja(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 04:11:42
Message-ID:	CALDgxVvHVZuq9ZZX_tTHLGQW89LhZr2x0NUaa-eMALCkbfBDiw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 19, 2012 at 05:03, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja(at)gmail(dot)com> wrote:
> > Does it make sense for regexps to have collations?
>
> As I understand it, collations determine the sort-ordering of strings.
> Regular expressions don't care about that. Why do you ask?
>

Perhaps I used the wrong term, but I was thinking the locale could tell us
what alphabet we're dealing with. So a regexp using en_US would give
different word-boundary results from one using zh_CN.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 04:16:53
Message-ID:	12859.1329625013@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> In theory you can imagine a regular expression engine where these
> decisions can be postponed until we see the string we're matching
> against. IOW, your DFA ends up with state transitions for characters
> specifically named, plus a state transition for "anything else that's
> a letter", plus a state transition for "anything else not otherwise
> specified". Then you only need to test the letters that actually
> appear in the target string, rather than all of the ones that might
> appear there.

> But implementing that could be quite a lot of work.

Yeah, not to mention slow. The difficulty is overlapping sets of
characters. As a simple example, if your regex refers to 3, 7,
[[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
"colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
that aren't any of the preceding. And state transitions for the digit
and alnum cases had better mention all and only the correct colors.
I've been tracing through the logic this evening, and it works pretty
simply given that all named character classes are immediately expanded
out to their component characters. If we are going to try to keep
the classes in some kind of symbolic form, it's a lot messier. In
particular, I think your sketch above would lead to having to test
every character against iswdigit and iswalnum at runtime, which would
be disastrous performancewise. I'd like to at least avoid that for the
shorter (and presumably more common) UTF8 codes.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Vik Reykja <vikreykja(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 04:41:55
Message-ID:	13378.1329626515@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Vik Reykja <vikreykja(at)gmail(dot)com> writes:
> On Sun, Feb 19, 2012 at 05:03, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja(at)gmail(dot)com> wrote:
>>> Does it make sense for regexps to have collations?

>> As I understand it, collations determine the sort-ordering of strings.
>> Regular expressions don't care about that. Why do you ask?

> Perhaps I used the wrong term, but I was thinking the locale could tell us
> what alphabet we're dealing with. So a regexp using en_US would give
> different word-boundary results from one using zh_CN.

Our interpretation of a "collation" is that it sets both LC_COLLATE and
LC_CTYPE. Regexps may not care about the first but they definitely care
about the second. This is why the stuff in regc_pg_locale.c pays
attention to collation.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-19 04:49:32
Message-ID:	CA+TgmoYs86AawgrUnEfE=HsMGm8r2dN-x-2W-eL-8dUurQD7cQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> In theory you can imagine a regular expression engine where these
>> decisions can be postponed until we see the string we're matching
>> against. IOW, your DFA ends up with state transitions for characters
>> specifically named, plus a state transition for "anything else that's
>> a letter", plus a state transition for "anything else not otherwise
>> specified". Then you only need to test the letters that actually
>> appear in the target string, rather than all of the ones that might
>> appear there.
>
>> But implementing that could be quite a lot of work.
>
> Yeah, not to mention slow. The difficulty is overlapping sets of
> characters. As a simple example, if your regex refers to 3, 7,
> [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
> "colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
> that aren't any of the preceding. And state transitions for the digit
> and alnum cases had better mention all and only the correct colors.

Yeah, that's unfortunate. On the other hand, if you don't use colors
for this case, aren't you going to need, for each DFA state, a
gigantic lookup table that includes every character in the server
encoding? Even if you've got plenty of memory, initializing such a
beast seems awfully expensive, and it might not do very good things
for cache locality, either.

> I've been tracing through the logic this evening, and it works pretty
> simply given that all named character classes are immediately expanded
> out to their component characters. If we are going to try to keep
> the classes in some kind of symbolic form, it's a lot messier. In
> particular, I think your sketch above would lead to having to test
> every character against iswdigit and iswalnum at runtime, which would
> be disastrous performancewise. I'd like to at least avoid that for the
> shorter (and presumably more common) UTF8 codes.

Hmm, but you could cache that information. Instead of building a
cache that covers every possible character that might appear in the
target string, you can just cache the results for the code points that
you actually see.

Yet another option would be to dictate that the cache can't holes - it
will always include information for every code point from 0 up to some
value X. If we see a code point in the target string which is greater
than X, then we extend the cache out as far as that code point. That
way, people who are using only code points out to U+FF (or even U+7F)
don't pay the cost of building a large cache, but people who need it
can get correct behavior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-23 23:16:30
Message-ID:	1330038990.6474.24.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote:
> > What if you did this ONCE and wrote the results to a file someplace?
>
> That's still a cache, you've just defaulted on your obligation to think
> about what conditions require the cache to be flushed. (In the case at
> hand, the trigger for a cache rebuild would probably need to be a glibc
> package update, which we have no way of knowing about.)

We basically hardwire locale behavior at initdb time, so computing this
then and storing it somewhere for eternity would be consistent.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Notes about fixing regexes and UTF-8 (yet again)
Date:	2012-02-24 00:10:31
Message-ID:	26174.1330042231@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote:
>> That's still a cache, you've just defaulted on your obligation to think
>> about what conditions require the cache to be flushed. (In the case at
>> hand, the trigger for a cache rebuild would probably need to be a glibc
>> package update, which we have no way of knowing about.)

> We basically hardwire locale behavior at initdb time, so computing this
> then and storing it somewhere for eternity would be consistent.

Well, only if we could cache every locale-related libc inquiry we ever
make. Locking down only part of the behavior does not sound like a
plan.

regards, tom lane