Future of our regular expression code

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Future of our regular expression code
Date: 2012-02-18 18:15:28
Message-ID: 29776.1329588928@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

As those who've been paying attention to it know, our regular expression
library is based on code originally developed by Henry Spencer and
contributed by him to the Tcl project. We adopted it out of Tcl in
2003. Henry intended to package the code as a standalone library as
well, but that never happened --- AFAICT, Henry dropped off the net
around 2002, and I have no idea what happened to him.

Since then, we've been acting as though the Tcl guys are upstream
maintainers for the regex code, but in point of fact there does not
appear to be anybody there with more than the first clue about that
code. This was brought home to me a few days ago when I started talking
to them about possible ways to fix the quantified-backrefs problem that
depesz recently complained of (which turns out to have been an open bug
in their tracker since 2005). As soon as I betrayed any indication of
knowing the difference between a DFA and an NFA, they offered me commit
privileges :-(. And they haven't fixed any other significant bugs in
the engine in years, either.

So I think it's time to face facts and accept that Tcl are not a useful
upstream for the regex code. And we can't just let it sit quietly,
because we have bugs to fix (at least the one) as well as enhancement
requests such as the nearby discussion about recognizing high Unicode
code points as letters.

A radical response to this would be to drop the Spencer regex engine and
use something else instead --- probably PCRE, because there are not all
that many alternatives out there. I do not care much for this idea
though. It would be a significant amount of work in itself, and there's
no real guarantee that PCRE will continue to be maintained either, and
there would be some user-visible compatibility issues because the regex
flavor is a bit different. A larger point is that it'd be a real shame
for the Spencer regex engine to die off, because it is in fact one of
the best pieces of regex technology on the planet. See Jeffrey Friedl's
"Mastering Regular Expressions" (O'Reilly) --- at least, that's what he
thought in the 2002 edition I have, and it's unlikely that things have
changed much.

So I'm feeling that we gotta suck it up and start acting like we are
the lead maintainers for this code, not just consumers.

Another possible long-term answer is to finish the work Henry never did,
that is make the code into a standalone library. That would make it
available to more projects and perhaps attract other people to help
maintain it. However, that looks like a lot of work too, with distant
and uncertain payoff.

Comments, other ideas?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-02-18 18:34:50 Re: Future of our regular expression code
Previous Message Jeff Janes 2012-02-18 18:01:04 Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)