Re: Our regex vs. POSIX on "longest match"

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brendan Jurd <direvus(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 17:34:22
Message-ID: 16069.1330882462@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Brendan Jurd <direvus(at)gmail(dot)com> writes:
> On 4 March 2012 17:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Brendan Jurd <direvus(at)gmail(dot)com> writes:
>>> I'll admit that this is a pretty obscure point, but we do appear to be
>>> in direct violation of POSIX here.

>> How so? POSIX doesn't contain any non-greedy constructs. If you use
>> only the POSIX-compatible greedy constructs, the behavior is compliant.

> While it's true that POSIX doesn't contemplate non-greed, after
> reading the spec I would have expected an expression *as a whole* to
> still prefer the longest match, insofar as that is possible while
> respecting non-greedy particles.

It's not apparent to me that a construct that is outside the spec
shouldn't be expected to change the overall behavior. In particular,
what if the RE contains only non-greedy quantifiers --- would you still
expect longest match overall? Henry certainly didn't.

> I just ran some quick experiments in
> Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'. All returned
> 'y1234', which to my mind is the most obvious answer, and one which
> still makes sense in light of what POSIX has to say. Whereas Postgres
> (and presumably Tcl) return 'y1'.

Well, that's just an arbitrary example. The cases I remember people
complaining about in practice were the other way round: greedy
quantifier followed by non-greedy, and they were unhappy that the
non-greediness was effectively not respected (because the overall RE was
taken as greedy). So you can't fix the issue by pointing to POSIX and
saying "overall greedy is always the right thing".

Another thing I've seen people complain about is that an alternation
(anything with top-level "|") is always taken as greedy overall.
This seems a bit surprising to me --- I'd have expected a rule like
"inherits its greediness from the leftmost branch that has one",
though I'm not sure whether that would really be much of an improvement
in practice. (It would at least fix the problem that a leading "(a|b)"
determines greediness of the containing RE whereas the logically
equivalent "[ab]" does not.) Again, pointing to POSIX and saying
"overall greedy is the right thing" doesn't help.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-03-04 17:40:04 Re: Patch: improve selectivity estimation for IN/NOT IN
Previous Message Simon Riggs 2012-03-04 16:39:58 Re: RFC: Making TRUNCATE more "MVCC-safe"