Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-03-24 22:24:33
Message-ID: CAPpHfdsRdG1M9HU1=71WM1eXYHGDXSooyuAKPXYZQmrfmnKcGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > Now I have working implemetation of this API. Comments still need rework.
> > Could you give me any feedback?
>
> I looked at this a little bit, but it's not very far along at all
> towards resolving my API worries. The basic point that I'm concerned
> about is that we would like to split off the regex library (ie,
> backend/regex/) as a standalone project someday. There are already
> some problems to be resolved to make that possible, and I don't want
> this patch to introduce new ones. From that standpoint, the proposed
> regextract.c file is a disaster, because it depends on a boatload of
> Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
> to say nothing of palloc). We can't consider that to be on the regex
> side of the fence.
>

Now I can see your position about API much more clearly. Previously I
thought only about algorithmic side of things.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
> into regex/regex.h is completely not the right thing for preserving a
> clear library boundary (even positing that we want to expose that
> function outside adt/regexp.c, which I'd rather we didn't).
>
> Given that you've already got a notion of callbacks provided by
> contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
> into those callbacks, so that the heavy-duty data structures like the
> hash table live over there and the API exposed by backend/regex/ is at
> a much simpler level than what you have here. But right now I don't
> see any usable library API here.
>
> Perhaps we could avoid these issues by defining a library API that
> provides accessors like these for the opaque regex_t struct:
>
> * get the number of states in the CNFA
>
> * get the numbers of the initial and final states
>
> * get the number of out-arcs for the N'th state
>
> * get the out-arcs for the N'th state into a caller-provided array
> (sized using the previous function), where each out-arc is represented
> by a color and an end-state
>
> * get the number of character codes represented by color C
>
> * get the wchar codes for color C into a caller-provided array
>
> (The reason for letting the caller allocate the result arrays is so we
> can use palloc for that; if we allocate it in backend/regex/ we must
> use malloc, which will greatly increase the risk of leakages. Also,
> as far as the color API goes, the above lets the caller decide how
> many characters is "too many" to bother with.)
>

I like the this idea. Seems like clear and not over-engineered API. I can
implement it. Could you propose something particular to do with
RE_compile_and_cache in this patch? With this API we still need a way to
get regex_t or equivalent from string.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-03-24 22:38:47 Re: WIP: index support for regexp search
Previous Message Andrew Dunstan 2013-03-24 22:22:12 Re: Interesting post-mortem on a near disaster with git