Re: WIP: index support for regexp search

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-03-24 21:50:44
Message-ID: 5280.1364161844@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Now I have working implemetation of this API. Comments still need rework.
> Could you give me any feedback?

I looked at this a little bit, but it's not very far along at all
towards resolving my API worries. The basic point that I'm concerned
about is that we would like to split off the regex library (ie,
backend/regex/) as a standalone project someday. There are already
some problems to be resolved to make that possible, and I don't want
this patch to introduce new ones. From that standpoint, the proposed
regextract.c file is a disaster, because it depends on a boatload of
Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
to say nothing of palloc). We can't consider that to be on the regex
side of the fence.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
into regex/regex.h is completely not the right thing for preserving a
clear library boundary (even positing that we want to expose that
function outside adt/regexp.c, which I'd rather we didn't).

Given that you've already got a notion of callbacks provided by
contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
into those callbacks, so that the heavy-duty data structures like the
hash table live over there and the API exposed by backend/regex/ is at
a much simpler level than what you have here. But right now I don't
see any usable library API here.

Perhaps we could avoid these issues by defining a library API that
provides accessors like these for the opaque regex_t struct:

* get the number of states in the CNFA

* get the numbers of the initial and final states

* get the number of out-arcs for the N'th state

* get the out-arcs for the N'th state into a caller-provided array
(sized using the previous function), where each out-arc is represented
by a color and an end-state

* get the number of character codes represented by color C

* get the wchar codes for color C into a caller-provided array

(The reason for letting the caller allocate the result arrays is so we
can use palloc for that; if we allocate it in backend/regex/ we must
use malloc, which will greatly increase the risk of leakages. Also,
as far as the color API goes, the above lets the caller decide how
many characters is "too many" to bother with.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2013-03-24 22:06:23 Re: Interesting post-mortem on a near disaster with git
Previous Message Simon Riggs 2013-03-24 21:08:24 Re: Limiting setting of hint bits by read-only queries; vacuum_delay