Re: Future of our regular expression code

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Future of our regular expression code
Date: 2012-02-18 23:55:39
Message-ID: 7598.1329609339@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Yeah ... if you *don't* know the difference between a DFA and an NFA,
>> you're likely to find yourself in over your head. Having said that,

> So, here's a paper I found very nice to get started into this subject:
> http://swtch.com/~rsc/regexp/regexp1.html

Yeah, I just found that this afternoon myself; it's a great intro.

If you follow the whole sequence of papers (there are 4) you'll find out
that this guy built a new regexp engine for Google, and these papers are
basically introducing/defending its design. It turns out they've
released it under a BSD-ish license, so for about half a minute I was
thinking there might be a new contender for something we could adopt.
But there turn out to be at least two killer reasons why we won't:
* it's in C++ not C
* it doesn't support backrefs, as well as a few other features that
maybe aren't as interesting but still would represent compatibility
gotchas if they went away.
Too bad. But the papers are well worth reading. One thing I took away
from them is that it's possible to do capturing parens, though not
backrefs, without back-tracking. Spencer's code treats both of those
features as "messy" (ie, slow, because they force use of the NFA-style
backtracking search code). So there might be reason to reimplement
the parens-but-no-backrefs case using some ideas from these papers.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2012-02-19 00:24:50 Re: Future of our regular expression code
Previous Message Tom Lane 2012-02-18 23:45:10 Re: Notes about fixing regexes and UTF-8 (yet again)