Future of our regular expression code

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Future of our regular expression code
Date: 2012-02-18 18:15:28
Message-ID: 29776.1329588928@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

As those who've been paying attention to it know, our regular expression
library is based on code originally developed by Henry Spencer and
contributed by him to the Tcl project. We adopted it out of Tcl in
2003. Henry intended to package the code as a standalone library as
well, but that never happened --- AFAICT, Henry dropped off the net
around 2002, and I have no idea what happened to him.

Since then, we've been acting as though the Tcl guys are upstream
maintainers for the regex code, but in point of fact there does not
appear to be anybody there with more than the first clue about that
code. This was brought home to me a few days ago when I started talking
to them about possible ways to fix the quantified-backrefs problem that
depesz recently complained of (which turns out to have been an open bug
in their tracker since 2005). As soon as I betrayed any indication of
knowing the difference between a DFA and an NFA, they offered me commit
privileges :-(. And they haven't fixed any other significant bugs in
the engine in years, either.

So I think it's time to face facts and accept that Tcl are not a useful
upstream for the regex code. And we can't just let it sit quietly,
because we have bugs to fix (at least the one) as well as enhancement
requests such as the nearby discussion about recognizing high Unicode
code points as letters.

A radical response to this would be to drop the Spencer regex engine and
use something else instead --- probably PCRE, because there are not all
that many alternatives out there. I do not care much for this idea
though. It would be a significant amount of work in itself, and there's
no real guarantee that PCRE will continue to be maintained either, and
there would be some user-visible compatibility issues because the regex
flavor is a bit different. A larger point is that it'd be a real shame
for the Spencer regex engine to die off, because it is in fact one of
the best pieces of regex technology on the planet. See Jeffrey Friedl's
"Mastering Regular Expressions" (O'Reilly) --- at least, that's what he
thought in the 2002 edition I have, and it's unlikely that things have
changed much.

So I'm feeling that we gotta suck it up and start acting like we are
the lead maintainers for this code, not just consumers.

Another possible long-term answer is to finish the work Henry never did,
that is make the code into a standalone library. That would make it
available to more projects and perhaps attract other people to help
maintain it. However, that looks like a lot of work too, with distant
and uncertain payoff.

Comments, other ideas?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 18:34:50
Message-ID: CA+U5nMJ9Nc+6s2AtROXvMx8XbARpjcHw8qih+AC1fp6A0SovPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 6:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> So I'm feeling that we gotta suck it up and start acting like we are
> the lead maintainers for this code, not just consumers.

By "we", I take it you mean you personally?

There are many requests I might make for allocations of your time and
that wouldn't even be a lower priority item on such a list. Of course,
your time allocation is not my affair, so please take my words as a
suggestion and a compliment.

Do we have volunteers that might save Tom from taking on this task?
It's not something that requires too much knowledge and experience of
PostgreSQL, so is an easier task for a newcomer.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 19:25:47
Message-ID: 20120218192547.GJ17355@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Simon Riggs (simon(at)2ndQuadrant(dot)com) wrote:
> On Sat, Feb 18, 2012 at 6:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > So I'm feeling that we gotta suck it up and start acting like we are
> > the lead maintainers for this code, not just consumers.
>
> By "we", I take it you mean you personally?

I'm pretty sure he meant the PG project, and I'd agree with him- we're
going to have to do it if no one else is. I suspect the Tcl folks will
be happy to look at incorporating anything we fix, if they can, but it
doesn't sound like they'll be able to help with fixing things much.

> Do we have volunteers that might save Tom from taking on this task?
> It's not something that requires too much knowledge and experience of
> PostgreSQL, so is an easier task for a newcomer.

Sure, it doesn't require knowledge of PG, but I dare say there aren't
very many newcomers who are going to walk in knowing how to manage
complex regex code.. I haven't seen too many who can update gram.y,
much less make our regex code handle Unicode better. I'm all for
getting other people to help with the code, of course, but I wouldn't
hold my breath and leave existing bugs open on the hopes that someone's
gonna show up.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 19:52:41
Message-ID: 1863.1329594761@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Simon Riggs (simon(at)2ndQuadrant(dot)com) wrote:
>> Do we have volunteers that might save Tom from taking on this task?
>> It's not something that requires too much knowledge and experience of
>> PostgreSQL, so is an easier task for a newcomer.

> Sure, it doesn't require knowledge of PG, but I dare say there aren't
> very many newcomers who are going to walk in knowing how to manage
> complex regex code.. I haven't seen too many who can update gram.y,
> much less make our regex code handle Unicode better. I'm all for
> getting other people to help with the code, of course, but I wouldn't
> hold my breath and leave existing bugs open on the hopes that someone's
> gonna show up.

Yeah ... if you *don't* know the difference between a DFA and an NFA,
you're likely to find yourself in over your head. Having said that,
this is eminently learnable stuff and pretty self-contained, so somebody
who had the time and interest could make themselves into an expert in
a reasonable amount of time. I'm not really eager to become the
project's regex guru, but only because I have ninety-nine other things
to do not because I don't find it interesting. Right at the moment I'm
probably far enough up the learning curve that I can fix the backref
problem faster than anyone else, so I'm kind of inclined to go do that.
But I'd be entirely happy to let someone else become the lead hacker in
this area going forward. What we can't do is just pretend that it
doesn't need attention.

In the long run I do wish that Spencer's code would become a standalone
package and have more users than just us and Tcl, but that is definitely
work I don't have time for now. I think somebody would need to commit
significant amounts of time over multiple years to give it any real hope
of success.

One immediate consequence of deciding that we are lead maintainers and
not just consumers is that we should put in some regression tests,
instead of taking the attitude that the Tcl guys are in charge of that.
I have a head cold today and am not firing on enough cylinders to do
anything actually complicated, so I was thinking of spending the
afternoon transliterating the Tcl regex test cases into SQL as a
starting point.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 20:04:53
Message-ID: CA+U5nMJu2UOsWXG3BGcw8q0xkPyS7OcjRMrkvCfY4pNca7O95w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 7:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> One immediate consequence of deciding that we are lead maintainers and
> not just consumers is that we should put in some regression tests,
> instead of taking the attitude that the Tcl guys are in charge of that.
> I have a head cold today and am not firing on enough cylinders to do
> anything actually complicated, so I was thinking of spending the
> afternoon transliterating the Tcl regex test cases into SQL as a
> starting point.

Having just had that brand of virus, I'd skip it and take the time
off, like I should have.

Translating the test cases is a great way in for a volunteer, so
please leave a few easy things to get people started on the road to
maintaining that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Vik Reykja <vikreykja(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 20:16:00
Message-ID: CALDgxVvKb0cggpuKRnsQ-JH7PBdTKdM79nn9aEqQVwJ2=fxtWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 21:04, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Sat, Feb 18, 2012 at 7:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > One immediate consequence of deciding that we are lead maintainers and
> > not just consumers is that we should put in some regression tests,
> > instead of taking the attitude that the Tcl guys are in charge of that.
> > I have a head cold today and am not firing on enough cylinders to do
> > anything actually complicated, so I was thinking of spending the
> > afternoon transliterating the Tcl regex test cases into SQL as a
> > starting point.
>
> Having just had that brand of virus, I'd skip it and take the time
> off, like I should have.
>
> Translating the test cases is a great way in for a volunteer, so
> please leave a few easy things to get people started on the road to
> maintaining that.
>
>
I would be willing to have a go at translating test cases. I do not (yet)
have the C knowledge to maintain the regex code, though.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 20:17:53
Message-ID: 4F400771.1000909@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/18/2012 02:25 PM, Stephen Frost wrote:
>> Do we have volunteers that might save Tom from taking on this task?
>> It's not something that requires too much knowledge and experience of
>> PostgreSQL, so is an easier task for a newcomer.
> Sure, it doesn't require knowledge of PG, but I dare say there aren't
> very many newcomers who are going to walk in knowing how to manage
> complex regex code.. I haven't seen too many who can update gram.y,
> much less make our regex code handle Unicode better. I'm all for
> getting other people to help with the code, of course, but I wouldn't
> hold my breath and leave existing bugs open on the hopes that someone's
> gonna show up.

Indeed, the number of people in the community who can hit the ground
running with this is probably vanishingly small, sadly. (I haven't
touched any formal DFA/NFA code in a couple of decades.)

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 20:37:10
Message-ID: 2791.1329597430@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Vik Reykja <vikreykja(at)gmail(dot)com> writes:
> On Sat, Feb 18, 2012 at 21:04, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Translating the test cases is a great way in for a volunteer, so
>> please leave a few easy things to get people started on the road to
>> maintaining that.

> I would be willing to have a go at translating test cases. I do not (yet)
> have the C knowledge to maintain the regex code, though.

Sure, have at it. I was thinking that we should make a new regex.sql
test file for any cases that are locale-independent. If they have any
that are dependent on recognizing non-ASCII characters as letters,
we could perhaps drop those into collate.linux.utf8.sql --- note that
we might need my draft patch from yesterday before anything outside the
LATIN1 character set would pass.

regards, tom lane


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 23:12:09
Message-ID: m2aa4fg2ie.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Yeah ... if you *don't* know the difference between a DFA and an NFA,
> you're likely to find yourself in over your head. Having said that,

So, here's a paper I found very nice to get started into this subject:

http://swtch.com/~rsc/regexp/regexp1.html

If anyone's interested into becoming our PostgreSQL regexp hero and
still needs a good kicker, I would recommend starting here :)

I see this paper mention the regexp code from Plan9, which supports both
UTF8 and other muti-byte encodings, and is released as a library under
the MIT licence:

http://swtch.com/plan9port/unix/

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 23:55:39
Message-ID: 7598.1329609339@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Yeah ... if you *don't* know the difference between a DFA and an NFA,
>> you're likely to find yourself in over your head. Having said that,

> So, here's a paper I found very nice to get started into this subject:
> http://swtch.com/~rsc/regexp/regexp1.html

Yeah, I just found that this afternoon myself; it's a great intro.

If you follow the whole sequence of papers (there are 4) you'll find out
that this guy built a new regexp engine for Google, and these papers are
basically introducing/defending its design. It turns out they've
released it under a BSD-ish license, so for about half a minute I was
thinking there might be a new contender for something we could adopt.
But there turn out to be at least two killer reasons why we won't:
* it's in C++ not C
* it doesn't support backrefs, as well as a few other features that
maybe aren't as interesting but still would represent compatibility
gotchas if they went away.
Too bad. But the papers are well worth reading. One thing I took away
from them is that it's possible to do capturing parens, though not
backrefs, without back-tracking. Spencer's code treats both of those
features as "messy" (ie, slow, because they force use of the NFA-style
backtracking search code). So there might be reason to reimplement
the parens-but-no-backrefs case using some ideas from these papers.

regards, tom lane


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-19 00:24:50
Message-ID: CACMqXCLt1+kfpOzjaQH5ZGjEVGiPeA4NAfZy1u2fYDvOG8RZzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 1:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr> writes:
>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>> Yeah ... if you *don't* know the difference between a DFA and an NFA,
>>> you're likely to find yourself in over your head.  Having said that,
>
>> So, here's a paper I found very nice to get started into this subject:
>>   http://swtch.com/~rsc/regexp/regexp1.html
>
> Yeah, I just found that this afternoon myself; it's a great intro.
>
> If you follow the whole sequence of papers (there are 4) you'll find out
> that this guy built a new regexp engine for Google, and these papers are
> basically introducing/defending its design.  It turns out they've
> released it under a BSD-ish license, so for about half a minute I was
> thinking there might be a new contender for something we could adopt.
> But there turn out to be at least two killer reasons why we won't:
> * it's in C++ not C
> * it doesn't support backrefs, as well as a few other features that
>  maybe aren't as interesting but still would represent compatibility
>  gotchas if they went away.

Another interesting library, technology-wise, is libtre:

http://laurikari.net/tre/about/
http://laurikari.net/tre/documentation/

NetBSD plans to replace the libc regex with it:

http://netbsd-soc.sourceforge.net/projects/widechar-regex/
http://groups.google.com/group/muc.lists.netbsd.current-users/browse_thread/thread/db5628e2e8f810e5/a99c368a6d22b6f8?lnk=gst&q=libtre#a99c368a6d22b6f8

Another useful project - AT&T regex tests:

http://www2.research.att.com/~gsf/testregex/

About our Spencer code - if we don't have resources (not called Tom)
to clean it up and make available as library (in short term - at least
to TCL folks) we should drop it. Because it means it's dead end,
however good it is.

--
marko


From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-19 01:41:55
Message-ID: CAFNqd5VHkOofeDxFCOOnTvcXZ9316aBokx0Y7tH8b0BZfXxs4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 7:24 PM, Marko Kreen <markokr(at)gmail(dot)com> wrote:
> About our Spencer code - if we don't have resources (not called Tom)

Is there anything that would be worth talking about directly with
Henry? He's in one of my circles of colleagues; had dinner with a
group that included him on Thursday.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: Marko Kreen <markokr(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-19 01:57:34
Message-ID: 9911.1329616654@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne <cbbrowne(at)gmail(dot)com> writes:
> On Sat, Feb 18, 2012 at 7:24 PM, Marko Kreen <markokr(at)gmail(dot)com> wrote:
>> About our Spencer code - if we don't have resources (not called Tom)

> Is there anything that would be worth talking about directly with
> Henry? He's in one of my circles of colleagues; had dinner with a
> group that included him on Thursday.

Really!? I had about come to the conclusion he was dead, because he's
sure been damn invisible as far as I could find. Is he still interested
in what happens with his regex code, or willing to answer questions
about it?

regards, tom lane


From: Brendan Jurd <direvus(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-19 02:58:35
Message-ID: CADxJZo25fv2yMOGaiGGhGiT949wFMKL4go-1J6URfCJiXPCrLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19 February 2012 06:52, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Yeah ... if you *don't* know the difference between a DFA and an NFA,
> you're likely to find yourself in over your head.  Having said that,
> this is eminently learnable stuff and pretty self-contained, so somebody
> who had the time and interest could make themselves into an expert in
> a reasonable amount of time.

I find myself in possession of both time and interest. I have to
admit up-front that I don't have experience with regex code, but I do
have some experience with parsers generally, and I'd like to think
some of that skillset would transfer to this problem. I also find
regexes fascinating and extremely useful, so learning more about them
will be no hardship.

I'd happily cede to an expert, should one appear, but otherwise I'm
all for moving the regex code into a discrete library, and I'm
volunteering to take a swing at it.

Cheers,
BJ


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brendan Jurd <direvus(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-19 04:49:10
Message-ID: 13530.1329626950@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brendan Jurd <direvus(at)gmail(dot)com> writes:
> On 19 February 2012 06:52, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah ... if you *don't* know the difference between a DFA and an NFA,
>> you're likely to find yourself in over your head. Having said that,
>> this is eminently learnable stuff and pretty self-contained, so somebody
>> who had the time and interest could make themselves into an expert in
>> a reasonable amount of time.

> I find myself in possession of both time and interest. I have to
> admit up-front that I don't have experience with regex code, but I do
> have some experience with parsers generally, and I'd like to think
> some of that skillset would transfer to this problem. I also find
> regexes fascinating and extremely useful, so learning more about them
> will be no hardship.

> I'd happily cede to an expert, should one appear, but otherwise I'm
> all for moving the regex code into a discrete library, and I'm
> volunteering to take a swing at it.

That sounds great.

BTW, if you don't have it already, I'd highly recommend getting a copy
of Friedl's "Mastering Regular Expressions". It's aimed at users not
implementers, but there is a wealth of valuable context information in
there, as well as a really good not-too-technical overview of typical
implementation techniques for RE engines. You'd probably still want one
of the more academic presentations such as the dragon book for
reference, but I think Freidl's take on it is extremely useful.

regards, tom lane


From: Brendan Jurd <direvus(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-19 22:51:36
Message-ID: CADxJZo2STcrAB=UevhahykBUFs8P27u3uPNWfXQGpiiXzMf6Wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19 February 2012 15:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> That sounds great.
>
> BTW, if you don't have it already, I'd highly recommend getting a copy
> of Friedl's "Mastering Regular Expressions".  It's aimed at users not
> implementers, but there is a wealth of valuable context information in
> there, as well as a really good not-too-technical overview of typical
> implementation techniques for RE engines.  You'd probably still want one
> of the more academic presentations such as the dragon book for
> reference, but I think Freidl's take on it is extremely useful.

Thanks for the recommendations Tom. I've now got Friedl, and there's
a dead-tree copy of 'Compilers' making its gradual way to me (no
ebook).
I've also been reading the article series by Russ Cox linked upthread
-- it's good stuff.

Are you far enough into the backrefs bug that you'd prefer to see it
through, or would you like me to pick it up?

Cheers,
BJ


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brendan Jurd <direvus(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-19 23:42:03
Message-ID: 5843.1329694923@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brendan Jurd <direvus(at)gmail(dot)com> writes:
> Are you far enough into the backrefs bug that you'd prefer to see it
> through, or would you like me to pick it up?

Actually, what I've been doing today is a brain dump. This code is
never going to be maintainable by anybody except its original author
without some internals documentation, so I've been trying to write
some based on what I've managed to reverse-engineer so far. It's
not very complete, but I do have some words about the DFA/NFA stuff,
which I will probably revise and fill in some more as I work on the
backref fix, because that's where that bug lives. I have also got
a bunch of text about the colormap management code, which I think
is interesting right now because that is what we are going to have
to fix if we want decent performance for Unicode \w and related
classes (cf the other current -hackers thread about regexes).
I was hoping to prevail on you to pick that part up as your first
project. I will commit what I've got in a few minutes --- look
for src/backend/regex/README in that commit. I encourage you to
add to that file as you figure stuff out. We could stand to upgrade
a lot of the code comments too, of course, but I think a narrative
description is pretty useful before diving into code.

regards, tom lane


From: Brendan Jurd <direvus(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 00:27:24
Message-ID: CADxJZo3Yi8D1qrEgtoa7QTHtt5ViXcVXzadbn5yyp_kYR99k3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20 February 2012 10:42, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I have also got
> a bunch of text about the colormap management code, which I think
> is interesting right now because that is what we are going to have
> to fix if we want decent performance for Unicode \w and related
> classes (cf the other current -hackers thread about regexes).
> I was hoping to prevail on you to pick that part up as your first
> project.  I will commit what I've got in a few minutes --- look
> for src/backend/regex/README in that commit.

Okay, I've read through your README content, it was very helpful.
I'll now go chew through some more reading material and then start
studying our existing regex source code. Once I'm firing on all
cylinders with this stuff, I'll begin to tackle the colormap.

Cheers,
BJ


From: Billy Earney <billy(dot)earney(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Brendan Jurd <direvus(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 00:32:55
Message-ID: CAB1ii-d1o0O2ch_9hofFdMGaZ63oyEnrCL948eqsTxvxb_8BQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

I did a google search, and found the following:
http://www.arglist.com/regex/

Which states that Tcl uses the same library from Henry. Maybe someone
involved with that project would help explain the library? Also I noticed
at the url above is a few ports people did from Henry's code. I didn't
download and analyze their code, but maybe they have made some comments
that could help, or maybe have some improvements to the code..

Just a thought.. :)

Billy Earney

On Sun, Feb 19, 2012 at 5:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Brendan Jurd <direvus(at)gmail(dot)com> writes:
> > Are you far enough into the backrefs bug that you'd prefer to see it
> > through, or would you like me to pick it up?
>
> Actually, what I've been doing today is a brain dump. This code is
> never going to be maintainable by anybody except its original author
> without some internals documentation, so I've been trying to write
> some based on what I've managed to reverse-engineer so far. It's
> not very complete, but I do have some words about the DFA/NFA stuff,
> which I will probably revise and fill in some more as I work on the
> backref fix, because that's where that bug lives. I have also got
> a bunch of text about the colormap management code, which I think
> is interesting right now because that is what we are going to have
> to fix if we want decent performance for Unicode \w and related
> classes (cf the other current -hackers thread about regexes).
> I was hoping to prevail on you to pick that part up as your first
> project. I will commit what I've got in a few minutes --- look
> for src/backend/regex/README in that commit. I encourage you to
> add to that file as you figure stuff out. We could stand to upgrade
> a lot of the code comments too, of course, but I think a narrative
> description is pretty useful before diving into code.
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Billy Earney <billy(dot)earney(at)gmail(dot)com>
Cc: Brendan Jurd <direvus(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 01:40:17
Message-ID: 15863.1329702017@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Billy Earney <billy(dot)earney(at)gmail(dot)com> writes:
> I did a google search, and found the following:
> http://www.arglist.com/regex/

Hmm ... might be worth looking at those two pre-existing attempts at
making a standalone library from Henry's code, just to see what choices
they made.

> Which states that Tcl uses the same library from Henry. Maybe someone
> involved with that project would help explain the library?

Um ... did you see the head message in this thread?

regards, tom lane


From: Billy Earney <billy(dot)earney(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Brendan Jurd <direvus(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 02:13:14
Message-ID: CAB1ii-f5OLZqvx57BTE+RD7jr26NVXtEQtBm8UcFzUvBPyw-Xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks Tom. I looked at the code in the libraries I referred to earlier,
and it looks like the code in the regex directory is exactly the same as
Walter Waldo's version, which has at least one comment from the middle of
last decade (~ 2003). Has people thought about migrating to the pcre
library? It seems to have a lot of neat features, and also has a jit, and
it looks like it is being actively maintained and has decent comments.

On Sun, Feb 19, 2012 at 7:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Billy Earney <billy(dot)earney(at)gmail(dot)com> writes:
> > I did a google search, and found the following:
> > http://www.arglist.com/regex/
>
> Hmm ... might be worth looking at those two pre-existing attempts at
> making a standalone library from Henry's code, just to see what choices
> they made.
>
> > Which states that Tcl uses the same library from Henry. Maybe someone
> > involved with that project would help explain the library?
>
> Um ... did you see the head message in this thread?
>
> regards, tom lane
>


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Billy Earney <billy(dot)earney(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Brendan Jurd <direvus(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 02:15:43
Message-ID: 20120220021543.GL17355@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Billy,

* Billy Earney (billy(dot)earney(at)gmail(dot)com) wrote:
> Thanks Tom. I looked at the code in the libraries I referred to earlier,
> and it looks like the code in the regex directory is exactly the same as
> Walter Waldo's version, which has at least one comment from the middle of
> last decade (~ 2003). Has people thought about migrating to the pcre
> library? It seems to have a lot of neat features, and also has a jit, and
> it looks like it is being actively maintained and has decent comments.

It strikes me that you might benefit from reading the full thread. As
Tom mentioned previously, pcre would require user-visible changes in
behavior, including cases where things which work today wouldn't work.
That requires a pretty high bar and I don't think we're anywhere near
there with this.

Thanks,

Stephen


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 03:28:16
Message-ID: CAM-w4HN1abmWjaPD7i0jqBYC2FOiq--W=f=QdCkggfttWGnH3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 6:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>  A larger point is that it'd be a real shame
> for the Spencer regex engine to die off, because it is in fact one of
> the best pieces of regex technology on the planet.
...
> Another possible long-term answer is to finish the work Henry never did,
> that is make the code into a standalone library.  That would make it
> available to more projects and perhaps attract other people to help
> maintain it.  However, that looks like a lot of work too, with distant
> and uncertain payoff.

I can't see how your first claim that the Spencer code is worth
keeping around because it's just a superior regex implementation has
much force unless we can accomplish the latter. If the library can be
split off into a standalone library then it might have some longevity.
But if we're the only ones maintaining it then it's just prolonging
the inevitable. I can't see Postgres having its own special brand of
regexes that nobody else uses being an acceptable situation forever.

One thing that concerns me more and more is that most sufficiently
powerful regex implementations are susceptible to DOS attacks. A
database application is quite likely to allow users to decide directly
or indirectly what regexes to apply and it can be hard to predict
which regexes will cause which implementations to explode its cpu or
memory requirements. We need a library that can be used to defend
against malicious regexes and i suspect neither Perl's nor Python's
library will suffice for this.

--
greg


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 03:38:05
Message-ID: 20120220033805.GM17355@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg,

* Greg Stark (stark(at)mit(dot)edu) wrote:
> I can't see how your first claim that the Spencer code is worth
> keeping around because it's just a superior regex implementation has
> much force unless we can accomplish the latter. If the library can be
> split off into a standalone library then it might have some longevity.
> But if we're the only ones maintaining it then it's just prolonging
> the inevitable. I can't see Postgres having its own special brand of
> regexes that nobody else uses being an acceptable situation forever.
>
> One thing that concerns me more and more is that most sufficiently
> powerful regex implementations are susceptible to DOS attacks. A
> database application is quite likely to allow users to decide directly
> or indirectly what regexes to apply and it can be hard to predict
> which regexes will cause which implementations to explode its cpu or
> memory requirements. We need a library that can be used to defend
> against malicious regexes and i suspect neither Perl's nor Python's
> library will suffice for this.

Alright, I'll bite.. Which existing regexp implementation that's well
written, well maintained, and which is well protected against malicious
regexes should we be considering then?

While we might not be able to formalize the regex code as a stand-alone
library, my bet would be that the Tcl folks (and anyone else using this
code..) will be paying attention to the changes and improvments we're
making. Sure, it'd be easier for them to incorporate those changes if
they could just pull in a new version of the library, but we can't all
have our cake and eat it too.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 03:43:18
Message-ID: 18230.1329709398@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> ... We need a library that can be used to defend
> against malicious regexes and i suspect neither Perl's nor Python's
> library will suffice for this.

Yeah. Did you read the Russ Cox papers referenced upthread? One of the
things Google wanted was provably limited resource consumption, which
motivated them going with a pure-DFA-no-exceptions implementation.
However, they gave up backrefs to get that, which is probably a
compromise we're not willing to make.

One thing that's been bothering me for awhile is that we don't have any
CHECK_FOR_INTERRUPTS or equivalent in the library's NFA search loops.
It wouldn't be hard to add one but that'd be putting PG-specific code
into the very heart of the library, which is something I've tried to
resist. One of the issues we'll have to face if we do try to split it
out as a standalone library is how that type of requirement can be met.
(And, BTW, that's the kind of hack that we would probably not get to
make at all with any other library, so the need for it is not evidence
that getting away from Spencer's code would be a good thing.)

regards, tom lane


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 05:04:14
Message-ID: 4F41D44E.8080601@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/19/2012 10:28 PM, Greg Stark wrote:
> One thing that concerns me more and more is that most sufficiently
> powerful regex implementations are susceptible to DOS attacks.

There's a list of "evil regexes" at http://en.wikipedia.org/wiki/ReDoS

The Perl community's reaction to Russ Cox's regex papers has some
interesting comments along these lines too:
http://www.perlmonks.org/?node_id=597262

That brings up the backreferences concerns Tom already mentioned.
Someone also points out the Thompson NFA that Cox advocates in his first
article can use an excessive amount of memory when processing Unicode:
http://www.perlmonks.org/?node_id=597312

Aside--Cox's "Regular Expression Matching with a Trigram Index" is an
interesting intro to trigram use for FTS purposes, and might have some
inspirational ideas for further progress in that area:
http://swtch.com/~rsc/regexp/regexp4.html

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 06:09:31
Message-ID: 4F41E39B.8010502@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost wrote:
> Alright, I'll bite.. Which existing regexp implementation that's well
> written, well maintained, and which is well protected against malicious
> regexes should we be considering then?

FWIW, there's a benchmark here that compares a number of regexp engines,
including PCRE, TRE and Russ Cox's RE2:

http://lh3lh3.users.sourceforge.net/reb.shtml

The fastest backtracking-style engine seems to be oniguruma, which is native
to Ruby 1.9 and thus not only supports Unicode but I'd bet performs pretty
well on it, on account of it's developed in Japan. But it goes pathological
on regexen containing '|'; the only safe choice among PCRE-style engines is
RE2, but of course that doesn't support backreferences.

Russ's page on re2 (http://code.google.com/p/re2/) says:

"If you absolutely need backreferences and generalized assertions, then RE2
is not for you, but you might be interested in irregexp, Google Chrome's
regular expression engine."

That's here:

http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html

Sadly, it's in Javascript. Seems like if you need a safe, performant regexp
implementation, your choice is (a) finish PLv8 and support it on all
platforms, or (b) add backreferences to RE2 and precompile it to C with
Comeau (if that's still around), or...

Jay


From: Billy Earney <billy(dot)earney(at)gmail(dot)com>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 21:25:28
Message-ID: CAB1ii-f83hQvC7mpbQQa5UuuvYdgCSpw6E1+wXghXEzWf=_YZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jay,

Good links, and I've also looked at a few others with benchmarks. I
believe most of the benchmarks are done before PCRE implemented jit. I
haven't found a benchmark with jit enabled, so I'm not sure if it will make
a difference. Also I'm not sure how accurately the benchmarks will show
how they will perform in an RDBMS environment. The optimizer probably is a
very important variable in many complex queries. I'm leaning towards
trying to implement RE2 and PCRE and running some benchmarks to see which
performs best.

Also would it be possible to set a session variable (lets say PGREGEXTYPE)
and set it to ARE (current alg), RE2, or PCRE, that way users could choose
which implementation they want (unless we find a single implementation that
beats the others in almost all categories)? Or is this a bad idea?

Just a thought.

On Mon, Feb 20, 2012 at 12:09 AM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com> wrote:

> Stephen Frost wrote:
>
>> Alright, I'll bite.. Which existing regexp implementation that's well
>> written, well maintained, and which is well protected against malicious
>> regexes should we be considering then?
>>
>
> FWIW, there's a benchmark here that compares a number of regexp engines,
> including PCRE, TRE and Russ Cox's RE2:
>
> http://lh3lh3.users.**sourceforge.net/reb.shtml<http://lh3lh3.users.sourceforge.net/reb.shtml>
>
> The fastest backtracking-style engine seems to be oniguruma, which is
> native to Ruby 1.9 and thus not only supports Unicode but I'd bet performs
> pretty well on it, on account of it's developed in Japan. But it goes
> pathological on regexen containing '|'; the only safe choice among
> PCRE-style engines is RE2, but of course that doesn't support
> backreferences.
>
> Russ's page on re2 (http://code.google.com/p/re2/**) says:
>
> "If you absolutely need backreferences and generalized assertions, then
> RE2 is not for you, but you might be interested in irregexp, Google
> Chrome's regular expression engine."
>
> That's here:
>
> http://blog.chromium.org/2009/**02/irregexp-google-chromes-**
> new-regexp.html<http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html>
>
> Sadly, it's in Javascript. Seems like if you need a safe, performant
> regexp implementation, your choice is (a) finish PLv8 and support it on all
> platforms, or (b) add backreferences to RE2 and precompile it to C with
> Comeau (if that's still around), or...
>
> Jay
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/**mailpref/pgsql-hackers<http://www.postgresql.org/mailpref/pgsql-hackers>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Billy Earney <billy(dot)earney(at)gmail(dot)com>
Cc: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 21:35:25
Message-ID: 2475.1329773725@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Billy Earney <billy(dot)earney(at)gmail(dot)com> writes:
> Also would it be possible to set a session variable (lets say PGREGEXTYPE)
> and set it to ARE (current alg), RE2, or PCRE, that way users could choose
> which implementation they want (unless we find a single implementation that
> beats the others in almost all categories)? Or is this a bad idea?

We used to have a GUC that selected the default mode for Spencer's
package (ARE, ERE, or BRE), and eventually gave it up on the grounds
that it did more harm than good. In particular, you really cannot treat
the regex operators as immutable if their behavior varies depending on
a GUC, which is more or less fatal from an optimization standpoint.
So I'd say a GUC that switches engines, and thereby brings in subtler
but no less real incompatibilities than the old one did, would be a
pretty bad idea.

Also, TBH I have exactly zero interest in supporting pluggable regex
engines in Postgres. Regex is not sufficiently central to what we do
to justify the work of coping with N different APIs and sets of
idiosyncrasies. (Perl evidently sees that differently, and with some
reason.)

regards, tom lane


From: Billy Earney <billy(dot)earney(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 21:59:00
Message-ID: CAB1ii-cmib-SXUX2DuNYX1epTxRK87A_5T=v1WGkvVWbC=_QkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

Thanks for your reply. So is the group leaning towards just maintaining
the current regex code base, or looking into introducing a new library
(RE2, PCRE, etc)? Or is this still open for discussion?

Thanks!

Billy

On Mon, Feb 20, 2012 at 3:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Billy Earney <billy(dot)earney(at)gmail(dot)com> writes:
> > Also would it be possible to set a session variable (lets say
> PGREGEXTYPE)
> > and set it to ARE (current alg), RE2, or PCRE, that way users could
> choose
> > which implementation they want (unless we find a single implementation
> that
> > beats the others in almost all categories)? Or is this a bad idea?
>
> We used to have a GUC that selected the default mode for Spencer's
> package (ARE, ERE, or BRE), and eventually gave it up on the grounds
> that it did more harm than good. In particular, you really cannot treat
> the regex operators as immutable if their behavior varies depending on
> a GUC, which is more or less fatal from an optimization standpoint.
> So I'd say a GUC that switches engines, and thereby brings in subtler
> but no less real incompatibilities than the old one did, would be a
> pretty bad idea.
>
> Also, TBH I have exactly zero interest in supporting pluggable regex
> engines in Postgres. Regex is not sufficiently central to what we do
> to justify the work of coping with N different APIs and sets of
> idiosyncrasies. (Perl evidently sees that differently, and with some
> reason.)
>
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Billy Earney <billy(dot)earney(at)gmail(dot)com>
Cc: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-20 22:20:55
Message-ID: 3493.1329776455@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Billy Earney <billy(dot)earney(at)gmail(dot)com> writes:
> Thanks for your reply. So is the group leaning towards just maintaining
> the current regex code base, or looking into introducing a new library
> (RE2, PCRE, etc)? Or is this still open for discussion?

Well, introducing a new library would create compatibility issues that
we'd just as soon not deal with, so I think that that's only likely
to be seriously entertained if we decide that Spencer's code is
unmaintainable. That's not a foregone conclusion; IMO the only fact
in evidence is that the Tcl community isn't getting it done.

Since Brendan Jurd has volunteered to try to split that code out into a
standalone library, I think such a decision really has to wait until we
see if (a) he's successful and (b) the result attracts some kind of
community around it. So in short, let's give him a couple years and
then if things are no better we'll revisit this issue.

regards, tom lane


From: Vik Reykja <vikreykja(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-03-10 11:11:38
Message-ID: CALDgxVtxJ4Y7qrPobN9wY6Cd8t+Ycw-rp-=fJrxwb8KyHVW+0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 21:16, Vik Reykja <vikreykja(at)gmail(dot)com> wrote:

> I would be willing to have a go at translating test cases. I do not (yet)
> have the C knowledge to maintain the regex code, though.

I got suddenly swamped and forgot I had signed up for this. I'm still
pretty swamped and I would like these regression tests to be in for 9.2 so
if someone else would like to pick them up, I would be grateful.

If they're still not done by the time I resurface, I will attack them again.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vik Reykja <vikreykja(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-03-10 16:26:44
Message-ID: 22164.1331396804@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Vik Reykja <vikreykja(at)gmail(dot)com> writes:
> On Sat, Feb 18, 2012 at 21:16, Vik Reykja <vikreykja(at)gmail(dot)com> wrote:
>> I would be willing to have a go at translating test cases. I do not (yet)
>> have the C knowledge to maintain the regex code, though.

> I got suddenly swamped and forgot I had signed up for this. I'm still
> pretty swamped and I would like these regression tests to be in for 9.2 so
> if someone else would like to pick them up, I would be grateful.

FWIW, I spent a few minutes looking at the Tcl regression tests and
realized that they are not in a form that's tremendously useful to us.
What they are, unsurprisingly, are Tcl scripts, and a lot of the
specific test cases are couched as calls to special-purpose Tcl
functions. I tried inserting some hooks that would print out the
arguments/results of the underlying regexp and regsub calls, but didn't
get far (my Tcl is way too rusty :-(). I also found that quite a few of
the test cases are concerned with features that are not accessible, or
at least not accessible in the same way, from our SQL functions. Those
test cases would still be worthwhile for a standalone library package,
but they won't be much use in a Postgres regression test.

regards, tom lane