Re: BUG #5469: regexp_matches() has poor behaviour and more poor documentation

From: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5469: regexp_matches() has poor behaviour and more poor documentation
Date: 2010-05-26 11:58:15
Message-ID: AANLkTil7bJBR-pH_tDUCZqKtSTypFg_BqjTYpPC1cHL1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, May 26, 2010 at 4:51 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, May 24, 2010 at 9:16 AM, Daniele Varrazzo
> <daniele(dot)varrazzo(at)gmail(dot)com> wrote:
>> regexp_matches() has been recently discussed
>> (http://archives.postgresql.org/pgsql-bugs/2010-04/msg00026.php): it is a
>> setof function and as such it can drop results.
>>
>> Unfortunately it is an useful function to newcomers who use SQL, use regexps
>> but don't arrive to read the chapter 34.4.7. (i.e. Extending SQL -> Query
>> Language (SQL) Functions -> SQL Functions Returning Sets) and are not so
>> enlightened to know that "setof text[]" means "if it doesn't match, it drops
>> the record". They just expect the function to be a LIKE on steroids.
>>
>> Please describe the behavior in the documentation of the function (i.e.
>> table 9-6. and section 9.7.3), possibly provide a function with a saner
>> interface, i.e. returning a text[] of the first match or NULL on no match,
>> or document a workaround (suitable for an user knowing regexps but not
>> setof-returning functions) to make the function not dropping record (e.g. I
>> fixed the "bug" adding a "|" at the end of the pattern, so that the function
>> returns an array of NULL in case of no match: I don't think it is a trivial
>> workaround).
>
> I'm not sure that it's very productive to refer to the behavior of our
> code as insane.  We do document this in section 9.7.3, pretty
> explicitly:
>
> "The regexp_matches function returns all of the captured substrings
> resulting from matching a POSIX regular expression pattern. It has the
> syntax regexp_matches(string, pattern  [, flags  ]). If there is no
> match to the pattern, the function returns no rows. If there is a
> match, the function returns a text array whose n'th element is the
> substring matching the n'th parenthesized subexpression of the pattern
> (not counting "non-capturing" parentheses; see below for details)."
>
> I think that's pretty clear.  Your mileage may vary, of course.

"If there is no match to the pattern, the function returns no rows" is
easily overlooked as "it returns null", or some other behaviour that
don't change the returned set. The point is, because the function is
listed in the string function, you would expect the function to
manipulate text, not the dataset. The function as it is is not safe to
be used in a construct

SELECT foo, bar, regexp_matches(bar, pattern) FROM table;

unless you really wanted:

SELECT foo, bar, regexp_matches(bar, pattern) FROM table WHERE bar
~ pattern;

otherwise you have to take measures to be able to deal with records in
which the pattern is not matched, for example:

SELECT foo, bar, regexp_matches(bar, pattern || '|') FROM table;

the latter still doesn't work when bar is NULL: in this case the
record is dropped anyway, so I don't think it can be proposed as
general solution.

The characteristics of returning a set of text[] is useful when the
user wants all the matches, not only the first one: the behaviour is
selected specifying the flag 'g' as third argument.

From this point of view, I hope it can be stated that in its current
form the regexp_matches() has not the most optimal interface. Please
accept my apology for the tone being too rude in my previous message.

> I'm less confident than what we have in table 9-6 (other string
> functions, in section 9.4, string functions and operators) is clear on
> first reading, but neither do I immediately know how to improve it.
> Perhaps instead of critiquing our insanity you could provide some
> specific suggestions for improvement.
>
> Similarly, if you think we should have another function besides
> regexp_matches(), rather than just complaining that we don't, it would
> be more useful to suggest a name and a specific behavior and ideally
> maybe even provide a patch (or just the docs portion of a patch) -
> especially if you can point to a specific use-case that is hard to do
> with the SRF but would be easier with a function with a different
> interface.

Below I assume an alternative function is provided. I have problems in
finding a name for the function, as regexp_matches() is already used.
I would call it regexp_match() in reference to the fact that it
returns a single value (being an array) and not a list of matches as
potentially regexp_matches() could. The quite similar names could be a
problem though.

Because table 9-6 is the index people look for when they have a task
related to strings, I would say wording should be:

[regexp_matches:] Return all groups of captured substrings resulting
from matching a POSIX regular expression against the string. Warning:
in case of no match, tested record is dropped. See Section 9.7.3 for
more information.
[regexp_match:] Return the first group of captured substrings
resulting from matching a POSIX regular expression against the string.
In case of no match, return NULL. See Section 9.7.3 for more
information.

In section 9.7.3, after "If there is no match to the pattern, the
function returns no rows." I would add "This means that if the
function is used in a SELECT, records where the string don't match the
pattern are discarded from the dataset. If such records are required,
use regexp_match() instead".

Reference implementation for regexp_match() is:

CREATE OR REPLACE FUNCTION regexp_match(s text, pattern text)
RETURNS text[] AS
$$
DECLARE
rv text[];
BEGIN
SELECT * INTO rv FROM regexp_matches(s, pattern);
IF FOUND THEN
RETURN rv;
ELSE
RETURN NULL;
END IF;
END;
$$
LANGUAGE 'plpgsql'
IMMUTABLE STRICT ;

The reference implementation is rather inefficient: a more efficient
one can be easily provided in C, sharing code with
regexp_{matches,split}.

A minimal test case is:

test=> SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
regexp_match | {bar,beque}

test=> SELECT regexp_match('foobarbaz', '(bar)(beque)');
regexp_match |

test=> SELECT regexp_match(NULL, '(bar)(beque)');
regexp_match |

test=> SELECT regexp_match('foobarbequebaz', NULL);
regexp_match |

If the problem is acknowledged, I'd be happy to provide relevant patches.

-- Daniele

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Nelson da Silva 2010-05-26 12:04:51 BUG #5474: Installation
Previous Message Craig Ringer 2010-05-26 05:30:55 Re: BUG #5468: Pg doesn't send accepted root CA list to client during SSL client cert request