Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-03-14 17:40:06
Message-ID: CAPpHfdsuW_87rN9BxZ70aP_TeK2esovFxS0b4ep--aoLP06rmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> > On 23.01.2013 09:36, Alexander Korotkov wrote:
> >> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> The biggest problem is that I really don't care for the idea of
> >>> contrib/pg_trgm being this cozy with the innards of regex_t.
>
> >> The only option I see now is to provide a method like "export_cnfa"
> which
> >> would export corresponding CNFA in fixed format.
>
> > Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> > would probably be more readable too, if it didn't have to deal with the
> > regex guts representation of the CNFA. Also, once you have intermediate
> > representation of the original CNFA, you could do some of the
> > transformation work on that representation, before building the
> > "tranformed graph" containing trigrams. You could eliminate any
> > non-alphanumeric characters, joining states connected by arcs with
> > non-alphanumeric characters, for example.
>
> It's not just the CNFA though; the other big API problem is with mapping
> colors back to characters. Right now, that not only knows way too much
> about a part of the regex internals we have ambitions to change soon,
> but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
> which should be known to the regex library IMO. So I'm not sure how we
> divvy that up sanely. To be clear: I'm not going to insist that we have
> to have a clean API factorization before we commit this at all. But it
> worries me if we don't even know how we could get to that, because we
> are going to need it eventually.
>

Now I have following idea about API.
Put code of stage 2 (transform the original CNFA into an automaton-like
graph) into regex engine. It would use API which describes what exactly are
we going to extract from CNFA. This API could look like this.

typedef char *Prefix;
typedef char *ArcLabel;

typedef struct
{
Prefix newPrefix;
ArcLabel label;
} ArcInfo;

typedef struct
{
Prefix (*getInitialPrefix) ();
bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
void (*freePrefix) (Prefix prefix);
void (*freeArcLabel) (ArcLabel arcLabel);
} CFNATransformAPI;

getInitialPrefix returns initial prefix value like now this code does:
> initkey.prefix.colors[0] = UNKNOWN_COLOR;
> initkey.prefix.colors[1] = UNKNOWN_COLOR;
prefixContains are exactly same as function with this name.
getPrefixes and getArcs cycle step work of addKeys an addArcs.
freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel
strutures.

Additionally regex engine should provide correct way to examine colormap.
int getColorCharsCount(colormap *cm, color c);
pg_wchar *getColorChars(colormap *cm, color c);
getColorCharsCount would return -1 if this color should be considered as
unexpandable.

Any thoughts?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-03-14 18:40:47 Re: matview join view error
Previous Message Tom Lane 2013-03-14 17:30:46 Re: scanner/parser minimization