Re: WIP: index support for regexp search

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2012-12-03 10:05:16
Message-ID: 50BC795C.5000509@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02.12.2012 20:19, Tom Lane wrote:
> Alexander Korotkov<aekorotkov(at)gmail(dot)com> writes:
>> Nice idea to delay expanding colors to characters! Obviously, we should
>> delay expanding inly alphanumerical characters. Because non-alphanumberical
>> characters influence graph structure. Trying to implement...
>
> Uh, why would that be? Colors are colors. The regexp machinery doesn't
> care whether they represent alphanumerics or not. (Or to be more
> precise, if there is a situation where it makes a difference, separate
> colors will have been created for each set of characters that need to be
> distinguished.)

The regexp machinery doesn't care, but the trigrams that pg_trgm
extracts only contain alphanumeric characters. So if by looking at the
CNFA graph produced by the regexp machinery you conclude that any
matching strings must contain three-letter sequences "%oo" and "#oo",
you can just club them together into " oo" trigram.

I think you can run a pre-processing step to the colors, and merge
colors that are equivalent as far as trigrams are considered. For
example, if you have a color that contains only character '%', and
another that contains character '#', you can treat them as the same
hcolor. You might then be able to simplify the CNFA. Actually, it would
be even better if you could apply the pre-processing to the regexp
before the regexp machinery turns it into a CNFA. Not sure how easy it
would be to do such pre-processing.

BTW, why create the path matrix? You could check the "check" array of
trigrams in the consistent function directly against the graph.
Consistent should return true, iff there is a path through the graph
following only arcs that contain trigrams present in the check array.
Finding a path through a complex graph could be expensive, O(|E|), but
if the path is complex, the path matrix would be large as well, and
checking against a large matrix isn't exactly free either. It would
allow you to avoid "overflows" caused by having too many paths through
the graph.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2012-12-03 10:27:40 Re: ALTER command reworks
Previous Message Simon Riggs 2012-12-03 09:56:51 Re: Enabling Checksums