regexp_matches and regexp_split are inconsistent

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: regexp_matches and regexp_split are inconsistent
Date: 2007-08-11 01:25:34
Message-ID: 17867.1186795534@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I noticed the following behavior in CVS HEAD, using a pattern that is
capable of matching no characters:

regression=# SELECT foo FROM regexp_matches('ab cde', $re$\s*$re$, 'g') AS foo;
foo
-------
{""}
{""}
{" "}
{""}
{""}
{""}
{""}
(7 rows)

regression=# SELECT foo FROM regexp_split_to_table('ab cde', $re$\s*$re$) AS foo;
foo
-----
a
b
c
d
e
(5 rows)

If you count carefully, you will see that regexp_matches() reports a
match of the pattern at the start of the string and at the end of the
string, and also just before 'c' (after the match to the single space).
However, regexp_split() disregards these "degenerate" matches of the
same pattern.

Is this what we want? Arguably regexp_split is doing the most
reasonable thing for its intended usage, but the strict definition of
regexp matching seems to require what regexp_matches does. I think
we need to either change one function to match the other, or else
document the inconsistency.

Thoughts?

regards, tom lane


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: regexp_matches and regexp_split are inconsistent
Date: 2007-08-11 05:44:26
Message-ID: 162867790708102244q667d682ak5b37ac03c20b78d7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> If you count carefully, you will see that regexp_matches() reports a
> match of the pattern at the start of the string and at the end of the
> string, and also just before 'c' (after the match to the single space).
> However, regexp_split() disregards these "degenerate" matches of the
> same pattern.
>
> Is this what we want? Arguably regexp_split is doing the most
> reasonable thing for its intended usage, but the strict definition of
> regexp matching seems to require what regexp_matches does. I think
> we need to either change one function to match the other, or else
> document the inconsistency.
>

Regexp_matches behave is correct, but less usable. I thing space from
virtual begin to first char and from last char to virtual end can be
eliminated.

Regards
Pavel Stehule


From: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: regexp_matches and regexp_split are inconsistent
Date: 2007-08-11 16:59:03
Message-ID: 20070810193026.J20339@megazone.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Fri, 10 Aug 2007, Tom Lane wrote:

> I noticed the following behavior in CVS HEAD, using a pattern that is
> capable of matching no characters:
>
> regression=# SELECT foo FROM regexp_matches('ab cde', $re$\s*$re$, 'g') AS foo;
> foo
> -------
> {""}
> {""}
> {" "}
> {""}
> {""}
> {""}
> {""}
> (7 rows)
>
> regression=# SELECT foo FROM regexp_split_to_table('ab cde', $re$\s*$re$) AS foo;
> foo
> -----
> a
> b
> c
> d
> e
> (5 rows)
>
> If you count carefully, you will see that regexp_matches() reports a
> match of the pattern at the start of the string and at the end of the
> string, and also just before 'c' (after the match to the single space).
> However, regexp_split() disregards these "degenerate" matches of the
> same pattern.
>
> Is this what we want? Arguably regexp_split is doing the most
> reasonable thing for its intended usage, but the strict definition of
> regexp matching seems to require what regexp_matches does. I think
> we need to either change one function to match the other, or else
> document the inconsistency.
>
> Thoughts?

I'm not sure how many languages do this, but at least perl seems to work
similarly, which makes me guess that it's probably similar in a bunch of
languages. If it is, then we should probably just document the
inconsistency.

Perl seems to document the split behavior with "Empty leading (or
trailing) fields are produced when there are positive width matches at the
beginning (or end) of the string; a zero-width match at the beginning (or
end) of the string does not produce an empty field."


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: regexp_matches and regexp_split are inconsistent
Date: 2007-08-13 01:20:03
Message-ID: 12786.1186968003@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com> writes:
> On Fri, 10 Aug 2007, Tom Lane wrote:
>> Is this what we want? Arguably regexp_split is doing the most
>> reasonable thing for its intended usage, but the strict definition of
>> regexp matching seems to require what regexp_matches does. I think
>> we need to either change one function to match the other, or else
>> document the inconsistency.

> I'm not sure how many languages do this, but at least perl seems to work
> similarly, which makes me guess that it's probably similar in a bunch of
> languages. If it is, then we should probably just document the
> inconsistency.

The Perl precedent is good enough for me. Documented...

regards, tom lane