Our regex vs. POSIX on "longest match"

Lists: pgsql-hackers
From: Brendan Jurd <direvus(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 06:36:58
Message-ID: CADxJZo2p29wHjvk3AL0w4LAhFCQ_-f-gPJcWLyTe+hK2gWevyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello folks,

I am in the process of accelerating down the rabbit hole of regex
internals. Something that came up during my reading, is that a POSIX
compliant regex engine ought to always prefer the longest possible
match, when multiple matches are possible beginning from the same
location in the string. [1]

I wasn't sure that that was how our regex engine worked, and indeed,
on checking the manual [2] I found that our regex engine uses a
strange sort of "inductive greediness" to determine whether the
longest or the shortest possible match ought to be preferred. The
greediness of individual particles in the regex are taken into
account, and at the top level the entire expression is concluded to be
either greedy, or non-greedy.

I'll admit that this is a pretty obscure point, but we do appear to be
in direct violation of POSIX here. I am getting the impression that
our engine works this way to route around some of the performance
issues that can arise in trying out every possible match with an
NFA-style engine.

I find it a bit unfortunate that POSIX is so unambiguous about this,
while our engine's treatment is, well ... quite arcane by comparison.
At minimum, I think we should be more explicit in the manual that this
behaviour flips POSIX the proverbial bird. Several paragraphs south,
there is a sentence reading "Incompatibilities of note include ... the
longest/shortest-match (rather than first-match) matching semantics",
but in the context it seems as though the author is talking about an
incompatibility with Perl, not with POSIX.

Thoughts?

Cheers,
BJ

[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
[2] http://www.postgresql.org/docs/9.0/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brendan Jurd <direvus(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 06:53:33
Message-ID: 6599.1330844013@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brendan Jurd <direvus(at)gmail(dot)com> writes:
> I am in the process of accelerating down the rabbit hole of regex
> internals. Something that came up during my reading, is that a POSIX
> compliant regex engine ought to always prefer the longest possible
> match, when multiple matches are possible beginning from the same
> location in the string. [1]

> I wasn't sure that that was how our regex engine worked, and indeed,
> on checking the manual [2] I found that our regex engine uses a
> strange sort of "inductive greediness" to determine whether the
> longest or the shortest possible match ought to be preferred. The
> greediness of individual particles in the regex are taken into
> account, and at the top level the entire expression is concluded to be
> either greedy, or non-greedy.

> I'll admit that this is a pretty obscure point, but we do appear to be
> in direct violation of POSIX here.

How so? POSIX doesn't contain any non-greedy constructs. If you use
only the POSIX-compatible greedy constructs, the behavior is compliant.

The issue that is obscure is, once you define some non-greedy
constructs, how to define how they should act in combination with greedy
ones. I'm not sure to what extent the engine's behavior is driven by
implementation restrictions and to what extent it's really the sanest
behavior Henry could think of. I found a comment from him about it:
http://groups.google.com/group/comp.lang.tcl/msg/c493317cc0d10d50
but it's short on details as to what alternatives he considered.

regards, tom lane


From: Brendan Jurd <direvus(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 11:55:33
Message-ID: CADxJZo3qSSOwT5aMbzLtp+uP6+HzoDjjvxYL_L2MvOKnt1oiWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4 March 2012 17:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Brendan Jurd <direvus(at)gmail(dot)com> writes:
>> I'll admit that this is a pretty obscure point, but we do appear to be
>> in direct violation of POSIX here.
>
> How so?  POSIX doesn't contain any non-greedy constructs.  If you use
> only the POSIX-compatible greedy constructs, the behavior is compliant.

While it's true that POSIX doesn't contemplate non-greed, after
reading the spec I would have expected an expression *as a whole* to
still prefer the longest match, insofar as that is possible while
respecting non-greedy particles. I just ran some quick experiments in
Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'. All returned
'y1234', which to my mind is the most obvious answer, and one which
still makes sense in light of what POSIX has to say. Whereas Postgres
(and presumably Tcl) return 'y1'.

What I found surprising here is that our implementation allows an
earlier quantifier to clobber the greediness of a later quantifier.
There's a certain disregard for the intentions of the author of the
regex. If I had wanted the whole expression to be non-greedy, I could
have written 'y*?\d+?'. But since I wrote 'y*?\d+', it is clear that
I meant for the first atom to be non-greedy, and for the second to be
greedy.

> The issue that is obscure is, once you define some non-greedy
> constructs, how to define how they should act in combination with greedy
> ones.  I'm not sure to what extent the engine's behavior is driven by
> implementation restrictions and to what extent it's really the sanest
> behavior Henry could think of.  I found a comment from him about it:
> http://groups.google.com/group/comp.lang.tcl/msg/c493317cc0d10d50
> but it's short on details as to what alternatives he considered.

Thanks for the link; it is always good to get more insight into
Henry's approach. I'm beginning to think I should just start reading
everything he ever posted to comp.lang.tcl ...

Cheers,
BJ


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brendan Jurd <direvus(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 17:34:22
Message-ID: 16069.1330882462@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brendan Jurd <direvus(at)gmail(dot)com> writes:
> On 4 March 2012 17:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Brendan Jurd <direvus(at)gmail(dot)com> writes:
>>> I'll admit that this is a pretty obscure point, but we do appear to be
>>> in direct violation of POSIX here.

>> How so? POSIX doesn't contain any non-greedy constructs. If you use
>> only the POSIX-compatible greedy constructs, the behavior is compliant.

> While it's true that POSIX doesn't contemplate non-greed, after
> reading the spec I would have expected an expression *as a whole* to
> still prefer the longest match, insofar as that is possible while
> respecting non-greedy particles.

It's not apparent to me that a construct that is outside the spec
shouldn't be expected to change the overall behavior. In particular,
what if the RE contains only non-greedy quantifiers --- would you still
expect longest match overall? Henry certainly didn't.

> I just ran some quick experiments in
> Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'. All returned
> 'y1234', which to my mind is the most obvious answer, and one which
> still makes sense in light of what POSIX has to say. Whereas Postgres
> (and presumably Tcl) return 'y1'.

Well, that's just an arbitrary example. The cases I remember people
complaining about in practice were the other way round: greedy
quantifier followed by non-greedy, and they were unhappy that the
non-greediness was effectively not respected (because the overall RE was
taken as greedy). So you can't fix the issue by pointing to POSIX and
saying "overall greedy is always the right thing".

Another thing I've seen people complain about is that an alternation
(anything with top-level "|") is always taken as greedy overall.
This seems a bit surprising to me --- I'd have expected a rule like
"inherits its greediness from the leftmost branch that has one",
though I'm not sure whether that would really be much of an improvement
in practice. (It would at least fix the problem that a leading "(a|b)"
determines greediness of the containing RE whereas the logically
equivalent "[ab]" does not.) Again, pointing to POSIX and saying
"overall greedy is the right thing" doesn't help.

regards, tom lane


From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 18:37:19
Message-ID: 20120304183719.GA10935@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 04, 2012 at 12:34:22PM -0500, Tom Lane wrote:
> Well, that's just an arbitrary example. The cases I remember people
> complaining about in practice were the other way round: greedy
> quantifier followed by non-greedy, and they were unhappy that the
> non-greediness was effectively not respected (because the overall RE was
> taken as greedy). So you can't fix the issue by pointing to POSIX and
> saying "overall greedy is always the right thing".

I was one of the complaining, and my point was that deciding for whole
regexp whether it's greedy or non-greedy is a bug (well, it might be
documented, but it's still *very* unexpected).

I stand on position that mixing greedy and non-greedy operators should
be possible, and that it should work according to POLA - i.e. greedines
of given atom shouldn't be influenced by other atoms.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: depesz(at)depesz(dot)com
Cc: Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 20:20:03
Message-ID: 2033.1330892403@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hubert depesz lubaczewski <depesz(at)depesz(dot)com> writes:
> I stand on position that mixing greedy and non-greedy operators should
> be possible, and that it should work according to POLA - i.e. greedines
> of given atom shouldn't be influenced by other atoms.

[ shrug... ] That sounds good, but it's pretty much vacuous as far as
defining a principled alternative behavior goes. It's easy to
demonstrate cases where atoms *must* be influenced by other ones.
A trivial example is
(.*)(.*)
It doesn't matter whether the second atom is greedy or not: it's not
going to get to eat anything because the first one does instead.
IOW this is just the same as
(.*)(.*?)
--- they are both overall-greedy.

regards, tom lane


From: Brendan Jurd <direvus(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 03:55:17
Message-ID: CADxJZo39+Ms61LgXNm0Tmi3uALNO8t4c5OmcujK+BCFUeHqjxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5 March 2012 04:34, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Brendan Jurd <direvus(at)gmail(dot)com> writes:
>> On 4 March 2012 17:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Brendan Jurd <direvus(at)gmail(dot)com> writes:
>> While it's true that POSIX doesn't contemplate non-greed, after
>> reading the spec I would have expected an expression *as a whole* to
>> still prefer the longest match, insofar as that is possible while
>> respecting non-greedy particles.
>
> It's not apparent to me that a construct that is outside the spec
> shouldn't be expected to change the overall behavior.  In particular,
> what if the RE contains only non-greedy quantifiers --- would you still
> expect longest match overall?  Henry certainly didn't.

Well, no, but then that wouldn't be "prefer the longest match, insofar
as that is possible while
respecting non-greedy particles". If all the quantifiers in an
expression are non-greedy, then it is trivially true that the only way
to respect the author's intention is to return the shortest match.

> Well, that's just an arbitrary example.  The cases I remember people
> complaining about in practice were the other way round: greedy
> quantifier followed by non-greedy, and they were unhappy that the
> non-greediness was effectively not respected (because the overall RE was
> taken as greedy).

I am unhappy about the reverse example too, and for the same reasons.

If we look at 'xy1234' ~ 'y*\d+?', Postgres gives us 'y1234'
(disregarding the non-greediness of the latter quantifier), and Python
gives us 'y1' (respecting both quantifiers).

So in Postgres, no matter how you mix the greediness up, some of your
quantifiers will not be respected.

> So you can't fix the issue by pointing to POSIX and
> saying "overall greedy is always the right thing".

"... insofar as that is possible while respecting non-greedy particles".

I will take Henry's word for it that this problem is harder than it
looks, but in any case, I think we may not presume to know better than
the author of the regex about the greediness of his quantifiers.

> Another thing I've seen people complain about is that an alternation
> (anything with top-level "|") is always taken as greedy overall.

Yeah, that is quirky.

Cheers,
BJ


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: depesz(at)depesz(dot)com, Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 06:23:10
Message-ID: CA+TgmoYiYQyJ-2e-erykmn-z128V8zpGrPxe8XKU4+sCA41nVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 4, 2012 at 3:20 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> hubert depesz lubaczewski <depesz(at)depesz(dot)com> writes:
>> I stand on position that mixing greedy and non-greedy operators should
>> be possible, and that it should work according to POLA - i.e. greedines
>> of given atom shouldn't be influenced by other atoms.
>
> [ shrug... ]  That sounds good, but it's pretty much vacuous as far as
> defining a principled alternative behavior goes.  It's easy to
> demonstrate cases where atoms *must* be influenced by other ones.
> A trivial example is
>        (.*)(.*)
> It doesn't matter whether the second atom is greedy or not: it's not
> going to get to eat anything because the first one does instead.
> IOW this is just the same as
>        (.*)(.*?)
> --- they are both overall-greedy.

But I don't think this makes Brendan's example not a bug. In this
case, we can't eat anything because there's nothing to eat. In his
example, there's something to eat, and it's sitting in a place where
it logically seems to line up with a greedy quantifier, and yet the
quantifier doesn't eat it. Now in fairness, I've seen my fair share
of apparently-buggy behavior from Perl's regex engine over the years
as well.

I think the right way to imagine this is as though the regular
expression were being matched to the source text in left-to-right
fashion. For example, suppose the RE is [ab]+?[bc]+ and the source
text is aabbcc. Clearly there's a match at position 1, but the match
could be anywhere between 3 and 6 characters in length, depending on
how greedy we think the RE should be, and the exact source text that
gets matched to each atom is up for grabs. Enumerating all the
possibilities where each atom matches a string that is consistent with
its definition, leaving greediness aside, we get this:

[aa,b]
[aa,bb]
[aa,bbc]
[aa,bbcc]
[aab,b]
[aab,bc]
[aab,bcc]
[aabb,c]
[aabb,cc]

To distinguish among these, we look first at [ab]+? and decide that,
since it is non-greedy, the right overall answer must be one of the
ones that assigns to [ab]+? a match of minimal length within the space
of possible matches. That narrows the field to just the first four
choices listed above. Then we look at [bc]+ and, since it is greedy,
give it a match of maximal length within the space of possible
matches, leading to [ab]+? = aa and [bc]+ = bbcc. Had the RE been
[ab]+?[bc]+?, the same algorithm would assign [ab]+? = aa and [bc]+? =
b; had it been [ab]+[bc]+? the same algorithm would assign [ab]+ =
aabb and [bc]+? = c. Testing, I find that this matches what Perl does
in all of those cases.

Things get a bit more complicated when you throw alternation into the
picture, but I think it's basically right to view it as a greedy
operation. Anything else seems rather unprincipled. It may seem
surprising that a+?|b+? matched against aaa will match the first
branch to aaa rather than just a, but there's no real reason to
suppose that the ? which applies to the plus should somehow bleed up
and affect the interpretation of the alternation. The RE might be
something like a+b+?|b+?a+ where the sub-REs have different greediness
interpretations and there's no obvious principled way to decide which
one should apply to the parent; or even something like cat|catfish
where the sub-REs don't contain any greedy/non-greedy operators at all
and yet we still have to assign some greediness to the alternation. I
think it's right to view every RE construct as greedy unless it's got
an explicit "not-greedy" flag attached to it; after all, that's the
traditional behavior of REs from time immemorial. Someone could
invent a non-greedy form of alternation if they were so inclined.
This is different from what Perl does, but I think Perl's behavior
here is batty: given a+|a+b+ and the string aaabbb, it picks the first
branch and matches only aaa. My whole being recoils: that surely
looks like a non-greedy interpretation of a regex with only greedy
quantifiers. It turns out that if you write a+b+|a+ then it matches
the whole string; apparently, it selects the syntactically first
branch that can match, regardless of the length of the match, which
strikes me as nearly pure evil. PostgreSQL - correctly, IMHO -
matches the longest substring available regardless of the syntactic
ordering; AIUI, the original charter for REs was to always match the
longest possible substring; non-greedy quantifiers were added later,
and should not be taken as a license to change the semantics of REs
that don't use them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Brendan Jurd <direvus(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, depesz(at)depesz(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 09:22:43
Message-ID: CADxJZo1fbE9FA+pW89dNqqiPpLstSxYKug9TLQcS_q+J7wF+_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5 March 2012 17:23, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> This is different from what Perl does, but I think Perl's behavior
> here is batty: given a+|a+b+ and the string aaabbb, it picks the first
> branch and matches only aaa.

Yeah, this is sometimes referred to as "ordered alternation",
basically that the branches of the alternation are prioritised in the
same order in which they are described. It is fairly commonplace
among regex implementations.

> apparently, it selects the syntactically first
> branch that can match, regardless of the length of the match, which
> strikes me as nearly pure evil.

As long as it's documented that alternation prioritises in this way, I
don't feel upset about it. At least it still provides you with a
sensible way to get whatever you want from your RE; if you want a
shorter alternative to be preferred, put it up the front. Ordered
alternation also gives you a way to specify which of several
same-length alternatives you would prefer to be matched, which can
come in handy. It also means you can specify less-complex
alternatives before more-complex ones, which can have performance
advantages.

I do agree with you that if you *don't* do ordered alternation, then
it is right to treat alternation as greedy by default.

Cheers,
BJ


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: depesz(at)depesz(dot)com, Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 16:28:24
Message-ID: 12400.1330964904@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I think the right way to imagine this is as though the regular
> expression were being matched to the source text in left-to-right
> fashion.

No, it isn't. You are headed down the garden path that leads to a
Perl-style definition-by-implementation, and in particular you are going
to end up with an implementation that fails to satisfy the POSIX
standard. POSIX requires an *overall longest* match (at least for cases
where all quantifiers are greedy), and that sometimes means that the
quantifiers can't be processed strictly left-to-right greedy. An
example of this is

regression=# select substring('aaaaaabab' from '(a*(ab)*)');
substring
-----------
aaaaaabab
(1 row)

If the a* is allowed to match as much as it wants, the (ab)* will not be
able to match at all, and then you fail to find the longest possible
overall match.

I suspect that it is possible to construct similar cases where, for an
all-non-greedy pattern, finding the overall shortest match sometimes
requires that individual quantifiers eat more than the local minimum.
I've not absorbed enough caffeine yet this morning to produce an example
though.

I probably shouldn't guess too much at Henry Spencer's thought
processes, but I think that he was looking for an extension of this
POSIX concept to mixed-greediness cases, ie you first define what the
overall RE matches and then let the individual quantifiers fight it out
as to which one gets how much of that. The particular way he did that
is obviously leaving a lot of people unsatisfied, but I think we need to
keep looking for rules of that sort, and not revert to defining the
behavior by a search algorithm.

> I think it's right to view every RE construct as greedy unless it's got
> an explicit "not-greedy" flag attached to it; after all, that's the
> traditional behavior of REs from time immemorial. Someone could
> invent a non-greedy form of alternation if they were so inclined.

I think you can do that already: "(foo|bar){1,1}?" (if this doesn't
result in a non-greedy alternation then it's a bug). The notation is
a bit ugly admittedly.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: depesz(at)depesz(dot)com, Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 17:28:24
Message-ID: CA+TgmoZ7n3Nh3DwDULDdX7Yt=MMfqpmmf+1JAznPt1N+gB=5Eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 5, 2012 at 11:28 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think the right way to imagine this is as though the regular
>> expression were being matched to the source text in left-to-right
>> fashion.
>
> No, it isn't.  You are headed down the garden path that leads to a
> Perl-style definition-by-implementation, and in particular you are going
> to end up with an implementation that fails to satisfy the POSIX
> standard.  POSIX requires an *overall longest* match (at least for cases
> where all quantifiers are greedy), and that sometimes means that the
> quantifiers can't be processed strictly left-to-right greedy.  An
> example of this is
>
> regression=# select substring('aaaaaabab' from '(a*(ab)*)');
>  substring
> -----------
>  aaaaaabab
> (1 row)
>
> If the a* is allowed to match as much as it wants, the (ab)* will not be
> able to match at all, and then you fail to find the longest possible
> overall match.

Oh. Right.

> I suspect that it is possible to construct similar cases where, for an
> all-non-greedy pattern, finding the overall shortest match sometimes
> requires that individual quantifiers eat more than the local minimum.
> I've not absorbed enough caffeine yet this morning to produce an example
> though.

Probably true. I guess, then, that the issue here is that there
isn't really any principled way to decide whether the RE overall
should be greedy or non-greedy. And similarly with every sub-RE. The
problem with the "non-greedy" quantifiers is that they apply only to
the quantified bit specifically, which leaves us guessing as to the
user's intent with regards to everything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, depesz(at)depesz(dot)com, Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 20:06:11
Message-ID: 20120305200611.GA6569@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 05, 2012 at 11:28:24AM -0500, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > I think the right way to imagine this is as though the regular
> > expression were being matched to the source text in left-to-right
> > fashion.
>
> No, it isn't. You are headed down the garden path that leads to a
> Perl-style definition-by-implementation, and in particular you are going
> to end up with an implementation that fails to satisfy the POSIX
> standard. POSIX requires an *overall longest* match (at least for cases
> where all quantifiers are greedy), and that sometimes means that the
> quantifiers can't be processed strictly left-to-right greedy. An
> example of this is

On the otherhand, I think requiring an "overall longest match" makes
your implementation non-polynomial complexity. The simplest example I
can think of is the knapsack problem, where given weights x_n and a
total W, can be converted to a regex problem as matching a string with
W a's against the regex:

a{x_1}?a{x_2}?a{x_3}? etc...

Yes, Perl (and others) don't guarentee an overall longest match. I
think they want you to consider regular expressions as a specialised
parsing language where you can configure a state machine to process
your strings. Not ideal, but predicatable.

The question is, what are users expecting of the PostgreSQL regex
implementation?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, depesz(at)depesz(dot)com, Brendan Jurd <direvus(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-05 20:30:13
Message-ID: 8837.1330979413@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> On the otherhand, I think requiring an "overall longest match" makes
> your implementation non-polynomial complexity.

Only if you don't know how to implement it -- a DFA-based implementation
doesn't have much trouble with this.

> [ equivalence of knapsack problem to regexes with bounded repetition ]

Interesting, but note that neither the POSIX spec nor our implementation
permit arbitrarily large repetition counts, so the theoretical
NP-completeness is only theoretical.

> The question is, what are users expecting of the PostgreSQL regex
> implementation?

I think a minimum expectation is that we adhere to the POSIX
specification.

regards, tom lane