Re: Regex pattern with shorter back reference does NOT work as expected

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regex pattern with shorter back reference does NOT work as expected
Date: 2013-07-13 17:13:21
Message-ID: 19880.1373735601@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com> writes:
>> Following example does not work as expected:
>>
>> -- Should return TRUE but returning FALSE
>> SELECT 'Programmer' ~ '(\w).*?\1' as t;

> This is clearly broken, but I'm uncomfortable with the proposed patch.
> As written, it changes behavior for both the shortest-match-preferred
> and longest-match-preferred cases; but you've offered no evidence that
> the longest-match case is broken. Maybe it is --- it's sure not
> obvious why it's okay to abandon the search early in this case. But I
> think we'd have been likely to hear about it before now if there were
> a matching failure in that path, since longest-preferred is so much
> more widely used.

After reflecting on this for awhile, I think that the longest-preferred
case is indeed also wrong in theory, but it's unreachable, which
explains the lack of bug reports. To get to the "no point in trying
again" code, we have to have a success of the DFA match followed by a
failure of the cdissect match, which essentially means that the string
would match if we didn't have to constrain some backref to exactly match
the string matched by its referent. Now, in the longest-preferred case,
the supposed early exit test is "end == begin", ie the tentative match
was of zero length overall. I can't envision any situation in which a
backref constraint could fail given that, because both the referent
pattern piece and the backref piece would have to be matching
zero-length substrings as well. There could be anchor constraints,
lookahead constraints, and so forth in play, but those would all have
been checked by the DFA, so they're not going to result in cdissect
failures. Hence, the "end == begin" test will simply never succeed.

On the other hand, the check made in the shortest-preferred case is
going to succeed if the tentative match was of maximal not minimal
length, and that's certainly a possible case.

So I think your patch is right, although I'd be inclined to refactor
the code to have just one test on "shorter", like this:

/* go around and try again, if possible */
if (shorter)
{
if (end == estop)
break;
estart = end + 1;
}
else
{
if (end == begin)
break;
estop = end - 1;
}

so as to make it clearer that we're just defending against selecting an
impossible new estart or estop location. (Although I argued above that
the "end == begin" test can't succeed, I wouldn't care to remove it
entirely here.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2013-07-13 20:54:28 Re: GSOC13 proposal - extend RETURNING syntax
Previous Message Fabien COELHO 2013-07-13 16:13:13 Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)