Re: Our regex vs. POSIX on "longest match"

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: depesz(at)depesz(dot)com, Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 17:28:24
Message-ID: CA+TgmoZ7n3Nh3DwDULDdX7Yt=MMfqpmmf+1JAznPt1N+gB=5Eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 5, 2012 at 11:28 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think the right way to imagine this is as though the regular
>> expression were being matched to the source text in left-to-right
>> fashion.
>
> No, it isn't.  You are headed down the garden path that leads to a
> Perl-style definition-by-implementation, and in particular you are going
> to end up with an implementation that fails to satisfy the POSIX
> standard.  POSIX requires an *overall longest* match (at least for cases
> where all quantifiers are greedy), and that sometimes means that the
> quantifiers can't be processed strictly left-to-right greedy.  An
> example of this is
>
> regression=# select substring('aaaaaabab' from '(a*(ab)*)');
>  substring
> -----------
>  aaaaaabab
> (1 row)
>
> If the a* is allowed to match as much as it wants, the (ab)* will not be
> able to match at all, and then you fail to find the longest possible
> overall match.

Oh. Right.

> I suspect that it is possible to construct similar cases where, for an
> all-non-greedy pattern, finding the overall shortest match sometimes
> requires that individual quantifiers eat more than the local minimum.
> I've not absorbed enough caffeine yet this morning to produce an example
> though.

Probably true. I guess, then, that the issue here is that there
isn't really any principled way to decide whether the RE overall
should be greedy or non-greedy. And similarly with every sub-RE. The
problem with the "non-greedy" quantifiers is that they apply only to
the quantified bit specifically, which leaves us guessing as to the
user's intent with regards to everything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-03-05 17:42:20 Re: RFC: Making TRUNCATE more "MVCC-safe"
Previous Message Tom Lane 2012-03-05 17:17:26 Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)