Re: WIP: index support for regexp search

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-03-24 22:38:47
Message-ID: 6275.1364164727@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Similarly, pushing PG-specific declarations like RE_compile_and_cache()
>> into regex/regex.h is completely not the right thing for preserving a
>> clear library boundary (even positing that we want to expose that
>> function outside adt/regexp.c, which I'd rather we didn't).
>>
>> Perhaps we could avoid these issues by defining a library API that
>> provides accessors like these for the opaque regex_t struct:
>>
>> * get the number of states in the CNFA
>>
>> * get the numbers of the initial and final states
>>
>> * get the number of out-arcs for the N'th state
>>
>> * get the out-arcs for the N'th state into a caller-provided array
>> (sized using the previous function), where each out-arc is represented
>> by a color and an end-state
>>
>> * get the number of character codes represented by color C
>>
>> * get the wchar codes for color C into a caller-provided array
>>
>> (The reason for letting the caller allocate the result arrays is so we
>> can use palloc for that; if we allocate it in backend/regex/ we must
>> use malloc, which will greatly increase the risk of leakages. Also,
>> as far as the color API goes, the above lets the caller decide how
>> many characters is "too many" to bother with.)

> I like the this idea. Seems like clear and not over-engineered API. I can
> implement it. Could you propose something particular to do with
> RE_compile_and_cache in this patch? With this API we still need a way to
> get regex_t or equivalent from string.

Well, the brute force answer is for pg_trgm to not go through
RE_compile_and_cache at all, but just call the regex library for itself.
That would lose the ability to cache regex compilations, but I'm not
sure we care. The motivation for RE_compile_and_cache is mainly to
prevent having to compile the regex again for *every row*, which is what
would otherwise happen in a simple SELECT using a regex function.
Unless I'm misunderstanding something, pg_trgm would only need to
compile the regex once per indexscan, which is probably tolerable.

Another approach we could take is for adt/regexp.c to expose a function
that collects the deconstructed CNFA data into a palloc'd structure
using the above-proposed functions. This'd be a reasonable way to go if
we decide the caching behavior is actually needed for this purpose; but
otherwise it seems like it's just involving one more module than
necessary.

BTW, one reason I don't like exposing RE_compile_and_cache directly is
that then you have the question of how long is the regex_t it hands back
good for. The caller doesn't own that struct, and at some point
regexp.c is likely to recycle it, so it's a bit risky to just assume
it'll stay around "long enough" without any further interaction with
regexp.c.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nicholas White 2013-03-25 00:15:10 Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Previous Message Alexander Korotkov 2013-03-24 22:24:33 Re: WIP: index support for regexp search