Re: Patch: regexp_matches variant returning an array of matching positions

From: David Johnston <polobo(at)yahoo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: regexp_matches variant returning an array of matching positions
Date: 2014-01-29 04:16:27
Message-ID: 1390968987243-5789414.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera-9 wrote
> Björn Harrtell wrote:
>> I've written a variant of regexp_matches called regexp_matches_positions
>> which instead of returning matching substrings will return matching
>> positions. I found use of this when processing OCR scanned text and
>> wanted
>> to prioritize matches based on their position.
>
> Interesting. I didn't read the patch but I wonder if it would be of
> more general applicability to return more info in a fell swoop a
> function returning a set (position, length, text of match), rather than
> an array. So instead of first calling one function to get the match and
> then their positions, do it all in one pass.
>
> (See pg_event_trigger_dropped_objects for a simple example of a function
> that returns in that fashion. There are several others but AFAIR that's
> the simplest one.)

Confused as to your thinking. Like regexp_matches this returns "SETOF
type[]". In this case integer but text for the matches. I could see adding
a generic function that returns a SETOF named composite (match varchar[],
position int[], length int[]) and the corresponding type. I'm not imagining
a situation where you'd want the position but not the text and so having to
evaluate the regexp twice seems wasteful. The length is probably a waste
though since it can readily be gotten from the text and is less often
needed. But if it's pre-calculated anyway...

My question is what position is returned in a multiple-match situation? The
supplied test only covers the simple, non-global, situation. It needs to
exercise empty sub-matches and global searches. One theory is that the
first array slot should cover the global position of match zero (i.e., the
entire pattern) within the larger document while sub-matches would be
relative offsets within that single match. This conflicts, though, with the
fact that _matches only returns array elements for () items and never for
the full match - the goal in this function being parallel un-nesting. But as
nesting is allowed it is still possible to have occur.

How does this resolve in the patch?

SELECT regexp_matches('abcabc','((a)(b)(c))','g');

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Patch-regexp-matches-variant-returning-an-array-of-matching-positions-tp5789321p5789414.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-01-29 04:39:14 Re: Observed Compilation warning in WIN32 build
Previous Message Robert Haas 2014-01-29 04:13:53 Re: updated emacs configuration