Re: WIP: index support for regexp search

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-01-23 15:29:56
Message-ID: 19081.1358954996@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 23.01.2013 09:36, Alexander Korotkov wrote:
>> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> The biggest problem is that I really don't care for the idea of
>>> contrib/pg_trgm being this cozy with the innards of regex_t.

>> The only option I see now is to provide a method like "export_cnfa" which
>> would export corresponding CNFA in fixed format.

> Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> would probably be more readable too, if it didn't have to deal with the
> regex guts representation of the CNFA. Also, once you have intermediate
> representation of the original CNFA, you could do some of the
> transformation work on that representation, before building the
> "tranformed graph" containing trigrams. You could eliminate any
> non-alphanumeric characters, joining states connected by arcs with
> non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters. Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO. So I'm not sure how we
divvy that up sanely. To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all. But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Anyway, I had another thought in the shower this morning, which is that
solving this problem in terms of color trigrams is really the Wrong
Thing. ISTM it'd be better to think of the CNFA traversal logic as
looking for required sequences of colors of unspecified length, which
we'd then chop into trigrams after the fact. This might produce
slightly better (more complete) trigram sets, but the real reason I'm
suggesting it is that I think the CNFA traversal code might be subject
to Polya's Inventor's Paradox: "the more general problem may be easier
to solve". It seems like casting the goal of that code as being to
find variable-length sequences, rather than exactly trigrams, might
lead to simpler data structures and more readable algorithms. The
still-to-be-designed regex API for this also seems like it would be
better off if decoupled from the notion of trigrams.

It's quite possible that this idea is all wet and no meaningful
improvement can be gotten this way. But I offer it for your
consideration.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-01-23 15:30:42 Re: unlogged tables vs. GIST
Previous Message Andres Freund 2013-01-23 15:26:16 Re: logical changeset generation v4