Re: WIP: index support for regexp search

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2012-11-26 12:55:19
Message-ID: 50B366B7.5060309@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 25.11.2012 22:55, Alexander Korotkov wrote:
> On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> Glad to see this patch hasn't been totally forgotten. Being able to use
>> indexes for regular expressions would be really cool!
>>
>> Back in January, I asked for some high-level description of how the
>> algorithm works (http://archives.postgresql.**
>> org/message-id/4F187D5C.30701@**enterprisedb.com<http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com>).
>> That's still sorely needed. Googling around, I found the slides for your
>> presentation on this from PGConf.EU - it would be great to have the
>> information from that presentation included in the patch.
>
>
> New version of patch is attached. The changes are following:
> 1) A big comment with high-level description of what is going on.
> 2) Regression tests.
> 3) Documetation update.
> 4) Some more refactoring.

Great, that top-level comment helped tremendously! I feel enlightened.

I fixed some spelling, formatting etc. trivial stuff while reading
through the patch, see attached. Below is some feedback on the details:

* I don't like the PG_TRY/CATCH trick. It's not generally safe to catch
an error, without propagating it further or rolling back the whole
(sub)transation. It might work in this case, as you're only suppressing
errors with the special sqlcode that are used in the same file, but it
nevertheless feels naughty. I believe none of the limits that are being
checked are strict; it's OK to exceed the limits somewhat, as long as
you terminate the processing in a reasonable time, in case of
pathological input. I'd suggest putting an explicit check for the limits
somewhere, and not rely on ereport(). Something like this, in the code
that recurses:

if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
hash_get_num_entries(trgmCNFA->states) > MAX_RESULT_STATES)
{
trgmCNFA->overflowed = true;
return;
}

And then check for the overflowed flag at the top level.

* This part of the high-level comment was not clear to me:

> * States of the graph produced in the first stage are marked with "keys". Key is a pair
> * of a "prefix" and a state of the original automaton. "Prefix" is a last
> * characters. So, knowing the prefix is enough to know what is a trigram when we read some new
> * character. However, we can know single character of prefix or don't know any
> * characters of it. Each state of resulting graph have an "enter key" (with that
> * key we've entered this state) and a set of keys which are reachable without
> * reading any predictable trigram. The algorithm of processing each state
> * of resulting graph are so:
> * 1) Add all keys which achievable without reading of any predictable trigram.
> * 2) Add outgoing arcs labeled with trigrams.
> * Step 2 leads to creation of new states and recursively processing them. So,
> * we use depth-first algorithm.

I didn't understand that. Can you elaborate? It might help to work
through an example, with some ascii art depicting the graph.

* It would be nice to add some comments to TrgmCNFA struct, explaining
which fields are valid at which stages. For example, it seems that
'trgms' array is calculated only after building the CNFA, by
getTrgmVector() function, while arcsCount is updated on the fly, while
recursing in the getState() function.

* What is the representation used for the path matrix? Needs a comment.

* What do the getColorinfo() and scanColorMap() functions do? What
exactly does a color represent? What's the tradeoff in choosing
MAX_COLOR_CHARS?

- Heikki

Attachment Content-Type Size
trgm-regexp-0.5-heikki1.patch text/x-diff 44.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-11-26 13:07:17 Re: Materialized views WIP patch
Previous Message Amit Kapila 2012-11-26 12:53:49 Re: Plugging fd leaks (was Re: Switching timeline over streaming replication)