Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-03-21 15:07:02
Message-ID: CAPpHfdtTxBiwGX5CuwQn2NKQJTYLsCghQDAcUR8D-BF=_hoSMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 14, 2013 at 9:40 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> > On 23.01.2013 09:36, Alexander Korotkov wrote:
>> >> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >>> The biggest problem is that I really don't care for the idea of
>> >>> contrib/pg_trgm being this cozy with the innards of regex_t.
>>
>> >> The only option I see now is to provide a method like "export_cnfa"
>> which
>> >> would export corresponding CNFA in fixed format.
>>
>> > Yeah, I think that makes sense. The transformation code in trgm_regexp.c
>> > would probably be more readable too, if it didn't have to deal with the
>> > regex guts representation of the CNFA. Also, once you have intermediate
>> > representation of the original CNFA, you could do some of the
>> > transformation work on that representation, before building the
>> > "tranformed graph" containing trigrams. You could eliminate any
>> > non-alphanumeric characters, joining states connected by arcs with
>> > non-alphanumeric characters, for example.
>>
>> It's not just the CNFA though; the other big API problem is with mapping
>> colors back to characters. Right now, that not only knows way too much
>> about a part of the regex internals we have ambitions to change soon,
>> but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
>> which should be known to the regex library IMO. So I'm not sure how we
>> divvy that up sanely. To be clear: I'm not going to insist that we have
>> to have a clean API factorization before we commit this at all. But it
>> worries me if we don't even know how we could get to that, because we
>> are going to need it eventually.
>>
>
> Now I have following idea about API.
> Put code of stage 2 (transform the original CNFA into an automaton-like
> graph) into regex engine. It would use API which describes what exactly are
> we going to extract from CNFA. This API could look like this.
>
> typedef char *Prefix;
> typedef char *ArcLabel;
>
> typedef struct
> {
> Prefix newPrefix;
> ArcLabel label;
> } ArcInfo;
>
> typedef struct
> {
> Prefix (*getInitialPrefix) ();
> bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
> Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
> ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
> void (*freePrefix) (Prefix prefix);
> void (*freeArcLabel) (ArcLabel arcLabel);
> } CFNATransformAPI;
>
> getInitialPrefix returns initial prefix value like now this code does:
> > initkey.prefix.colors[0] = UNKNOWN_COLOR;
> > initkey.prefix.colors[1] = UNKNOWN_COLOR;
> prefixContains are exactly same as function with this name.
> getPrefixes and getArcs cycle step work of addKeys an addArcs.
> freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel
> strutures.
>
> Additionally regex engine should provide correct way to examine colormap.
> int getColorCharsCount(colormap *cm, color c);
> pg_wchar *getColorChars(colormap *cm, color c);
> getColorCharsCount would return -1 if this color should be considered as
> unexpandable.
>

Now I have working implemetation of this API. Comments still need rework.
Could you give me any feedback?

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm-regexp-0.13.patch.gz application/x-gzip 23.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2013-03-21 15:15:30 Re: SIGHUP not received by custom bgworkers if postmaster is notified
Previous Message Kevin Grittner 2013-03-21 14:46:38 hstore compiler warnings