Quick Links

Pathological regexp match

Lists:	pgsql-hackers

From:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Subject:	Pathological regexp match
Date:	2010-01-28 22:14:04
Message-ID:	A12AF57F-4FB1-4AC4-A331-0F2DDB265433@myyearbook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

We came across a regexp that takes very much longer than expected.

PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit

SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity

?column?
----------
t
(1 row)

Time: 90525.148 ms

The full query is below.

The same match in perl takes less than 0.01 seconds on the same hardware.

#!/bin/env perl
use warnings;
use strict;

my $sample = 'ooo...'; # omitted for email brevity

if ($sample =~ /Z(Q)[^Q]*A.*?(\1)/) {
print 'matches';
}
else {
print 'does not match';
}

This is a simplified version of a match that finally finished after 18 hours.

Given the nearly 4 orders of magnitude difference between the Perl script and the Postgres version, is there something that could be improved in the Postgres regex engine?

Cheers,

Michael Glaesemann
michael(dot)glaesemann(at)myyearbook(dot)com

SELECT 'ooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooooooooooooooooooQoooooooooooooooooooooooooooooooooooQoooooooQooooooQooooZQoooooooooooAooooQoooooooQooooooooooooooooooooooQoooooooQooooooooooooooooooooooQoooooQoooooooooooAooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooooooooooooQoQooooooQoooooooQoooooooQoooooooQooooooooooooooooooooooooooooooooooooooooooZQoooooooooooooooooQooQoQoooooooQooooooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooZQoooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQooooooooooooooooooooooooooooooZQoooooooooooooQooQoQooooooooooQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooQoooooooooooooooooooooooooQoooooooQooooooooooooooooooooooooooooooooooooooQooooooooQoQoooooooooooooooooooooooooooooZQooooooooooooQooQoQooooooooooooooooooooooooooooooooooooooooooooooooooooooooZQooooooooooooooooooooooooooooooQooQoQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooZQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooQooooooooooQoooooooQoooooooooooQoooooooooooooo' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$;

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-01-29 02:59:22
Message-ID:	20100129025922.GA1793@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Michael,

Michael Glaesemann wrote:
> We came across a regexp that takes very much longer than expected.
>
> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
>
> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity

The ? after .* is pointless. If you remove it, the query returns
immediately.

(There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-01-29 03:37:39
Message-ID:	B53EF768-0C04-4D75-BE6A-1BC2934E0500@myyearbook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote:

> Hi Michael,
>
> Michael Glaesemann wrote:
>> We came across a regexp that takes very much longer than expected.
>>
>> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc
>> (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
>>
>> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email
>> brevity
>
> The ? after .* is pointless.

Interesting. I would expect that *? would be the non-greedy version of
*, meaning match up to the first \1 (in this case the first Q
following A), rather than as much as possible.

For example, in Perl:
$ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*(\1)/)
{ print \$&; } else { print 'NO'; }" && echo
ZQoooAoooQooQooQ
$ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*?(\1)/)
{ print \$&; } else { print 'NO'; }" && echo
ZQoooAoooQ

If I'm reading the docs right, Postgres does support non-greedy * as *?:

<http://www.postgresql.org/docs/8.4/interactive/functions-matching.html#POSIX-QUANTIFIERS-TABLE
>

However, as you point out, Postgres doesn't appear to take this into
account:

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*?(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

Michael Glaesemann
michael(dot)glaesemann(at)myyearbook(dot)com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-01-29 03:58:46
Message-ID:	20100129035846.GE1793@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Glaesemann wrote:
>
> On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote:
>
> >Hi Michael,
> >
> >Michael Glaesemann wrote:
> >>We came across a regexp that takes very much longer than expected.
> >>
> >>PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC
> >>gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
> >>
> >>SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email
> >>brevity
> >
> >The ? after .* is pointless.
>
> Interesting. I would expect that *? would be the non-greedy version
> of *, meaning match up to the first \1 (in this case the first Q
> following A), rather than as much as possible.

Huh, you are right, *? is the non-greedy version. I keep forgetting
those. Note that they only work if you have regex_flavor set to
advanced, though (which is the default).

> However, as you point out, Postgres doesn't appear to take this into
> account:
>
> postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
> [^Q]*A.*(\2))$r$, $s$X$s$);
> regexp_replace
> ----------------
> oooXooo
> (1 row)
>
> postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
> [^Q]*A.*?(\2))$r$, $s$X$s$);
> regexp_replace
> ----------------
> oooXooo
> (1 row)

Hmm, that's strange ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-01-29 04:21:42
Message-ID:	20100129042142.GF1793@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Glaesemann wrote:

I think the reason for this is that the first * is greedy and thus the
entire expression is considered greedy. The fact that you've made the
second * non-greedy does not ungreedify the RE ... Note the docs say:

The above rules associate greediness attributes not only with
individual quantified atoms, but with branches and entire REs
that contain quantified atoms. What that means is that the
matching is done in such a way that the branch, or whole RE,
matches the longest or shortest possible substring as a whole.

It's late here so I'm not sure if this is what you're looking for:

alvherre=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)[^Q]*?A.*(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooQooQooo
(1 fila)

(Obviously the non-greediness has moved somewhere else) :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-01-29 04:36:58
Message-ID:	9B3645EE-2DC2-4379-AC00-B71F8C5D3E55@myyearbook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 28, 2010, at 23:21 , Alvaro Herrera wrote:

> I think the reason for this is that the first * is greedy and thus the
> entire expression is considered greedy. The fact that you've made the
> second * non-greedy does not ungreedify the RE ... Note the docs say:
>
> The above rules associate greediness attributes not only with
> individual quantified atoms, but with branches and entire REs
> that contain quantified atoms. What that means is that the
> matching is done in such a way that the branch, or whole RE,
> matches the longest or shortest possible substring as a whole.

Interesting. Thanks for pointing out this section of the docs. I
wasn't aware of this twist.

> It's late here so I'm not sure if this is what you're looking for:

I'm not actually looking for a regexp that works: I was able to
accomplish the task I had at hand with a different regexp. I'm just
reporting the particular unexpected nastiness we ran into. :)

Michael Glaesemann
michael(dot)glaesemann(at)myyearbook(dot)com

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pathological regexp match
Date:	2010-01-29 13:14:22
Message-ID:	9837222c1001290514q3978889as3bba94b84714f69b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/1/29 Alvaro Herrera <alvherre(at)commandprompt(dot)com>:
> Hi Michael,
>
> Michael Glaesemann wrote:
>> We came across a regexp that takes very much longer than expected.
>>
>> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
>>
>> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity
>
> The ? after .* is pointless. If you remove it, the query returns
> immediately.
>
> (There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)

Incidentally, I ran across the exact same issue with a non-greedy
regexp with a client earlier this week, and put on my TODO to figure
out a good place to stick a check for interrupts. Does this mean I
don't have to, because you're on it? ;)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pathological regexp match
Date:	2010-01-29 15:51:03
Message-ID:	20100129155103.GA1982@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Magnus Hagander wrote:
> 2010/1/29 Alvaro Herrera <alvherre(at)commandprompt(dot)com>:

> > (There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)
>
> Incidentally, I ran across the exact same issue with a non-greedy
> regexp with a client earlier this week, and put on my TODO to figure
> out a good place to stick a check for interrupts. Does this mean I
> don't have to, because you're on it? ;)

No, sorry :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-01-30 07:07:46
Message-ID:	25719.1264835266@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com> writes:
> We came across a regexp that takes very much longer than expected.

I poked into this a little bit. What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion. In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime. With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second. This brings the
runtime down from minutes to sub-second on my machine. However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them. Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

regards, tom lane

Attachment	Content-Type	Size
regexp.patch	text/x-patch	2.2 KB

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pathological regexp match
Date:	2010-01-31 17:02:56
Message-ID:	9837222c1001310902g68b9c352i4bff1ed494dc47ac@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 30, 2010 at 08:07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com> writes:
>> We came across a regexp that takes very much longer than expected.
>
> I poked into this a little bit. What seems to be happening is that the
> use of non-greedy quantifiers plus backreferences turns off most of the
> optimization that the regexp engine usually does, leaving the RE as a
> tree of possibilities that is explored in a fairly dumb fashion. In
> particular, there is a loop in crevdissect() that tries to locate a
> feasible division point between two concatenated sub-patterns, and
> for each try, a recursive call to crevdissect() tries to locate a
> feasible division point between *its* two sub-patterns, resulting in
> O(N^2) runtime. With a not very pleasant constant factor, too, because
> of repeated startup/shutdown costs for the DFA matching engine.
>
> I found a possible patch that seems to improve matters: AFAICS the DFA
> matching is independent of the checking that cdissect() and friends do,
> so we can apply that check first instead of second. This brings the
> runtime down from minutes to sub-second on my machine. However I don't
> have a whole lot of faith either in the correctness of this change, or
> that it might not make some other cases slower instead of faster.
> Has anyone got a reasonably messy collection of regexps they'd like
> to try this patch on?

I only have the one, so I don't think I can help all that much with that.

> BTW, I filed the problem upstream with the Tcl project
> https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
> but I'm not going to hold my breath waiting for useful advice from
> them. Since Henry Spencer dropped off the net, there doesn't seem
> to be anybody there who understands that code much better than we do.
>
> Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
> in there ...

Yeah. I have zero experience around that code, so if oyu have at least
some, I'd much appreciate it if you (or someone who does) could look
at it. Likely to cause a lot less breakage than me :D

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-02-01 03:14:54
Message-ID:	27946.1264994094@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> I found a possible patch that seems to improve matters: AFAICS the DFA
> matching is independent of the checking that cdissect() and friends do,
> so we can apply that check first instead of second. This brings the
> runtime down from minutes to sub-second on my machine. However I don't
> have a whole lot of faith either in the correctness of this change, or
> that it might not make some other cases slower instead of faster.
> Has anyone got a reasonably messy collection of regexps they'd like
> to try this patch on?

The Tcl folk accepted that patch, so I went ahead and applied it to
our code. It would still be a good idea for us to do any testing we
can on it, though.

> Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
> in there ...

I looked at this a bit and decided that it would be messier than seems
justified in the absence of known problems. The regex code uses malloc
not palloc for transient data structures, so we can't just stick a
CHECK_FOR_INTERRUPTS() into some inner loop hotspot --- throwing a
longjmp would mean a permanent memory leak. I looked at integrating
into the error mechanism it does have, but it turns out that that's
not particularly designed to force immediate exit on failure either;
they just set a flag bit that will be tested whenever control does
return from the match. I think that a good fix would require first
changing the mallocs to pallocs, then add CHECK_FOR_INTERRUPTS.
But that's not something I have time to mess with at the moment.

regards, tom lane

From:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pathological regexp match
Date:	2010-02-01 18:57:31
Message-ID:	9855CF60-CFD5-4921-809B-694FC53DB9BA@myyearbook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 31, 2010, at 22:14 , Tom Lane wrote:

> The Tcl folk accepted that patch, so I went ahead and applied it to
> our code. It would still be a good idea for us to do any testing we
> can on it, though.

I applied the patch and ran both the test query I submitted as well as
original problematic query that triggered the report, and it runs much
faster. Thanks for the fix!

Michael Glaesemann
michael(dot)glaesemann(at)myyearbook(dot)com

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pathological regexp match
Date:	2010-02-08 13:15:32
Message-ID:	9837222c1002080515v5fee82baid8e88e9be0b457d4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/2/1 Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>:
>
> On Jan 31, 2010, at 22:14 , Tom Lane wrote:
>
>> The Tcl folk accepted that patch, so I went ahead and applied it to
>> our code. It would still be a good idea for us to do any testing we
>> can on it, though.
>
> I applied the patch and ran both the test query I submitted as well as original problematic query that triggered the report, and it runs much faster. Thanks for the fix!

I did the same, and it does not help in my case. FWIW, the regexp I'm
matching is:
<pre .*?>(.*?)</pre>

(yes, the production system has already been fixed to use a smarter
regexp that solves the same problem)

The text is about 180Kb. PostgreSQL takes ~40 seconds without the
patch, ~36 seconds with it, to extract the match from it. Perl takes
0.016 seconds.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pathological regexp match
Date:	2010-02-08 17:07:04
Message-ID:	93227435-49FA-4915-8FA6-6976062E65DD@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote:

> The text is about 180Kb. PostgreSQL takes ~40 seconds without the
> patch, ~36 seconds with it, to extract the match from it. Perl takes
> 0.016 seconds.

Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P

Best,

David

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pathological regexp match
Date:	2010-02-09 15:33:23
Message-ID:	dc7b844e1002090733s2bee3891x809a4f4541e054c5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 8, 2010 at 6:07 PM, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote:
>
>> The text is about 180Kb. PostgreSQL takes ~40 seconds without the
>> patch, ~36 seconds with it, to extract the match from it. Perl takes
>> 0.016 seconds.
>
> Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P

FWIW, PCRE is BSD licensed, so we could actually include it... Would
be interesting to see how it performs on the pattern at hand.

Joachim