Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2012-11-26 19:49:23
Message-ID: CAPpHfdvwGTEXHnxJXe-kne-Sr-DmJ0EiWJVYukgQR9Ophj3z=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:
>
> Great, that top-level comment helped tremendously! I feel enlightened.
>
> I fixed some spelling, formatting etc. trivial stuff while reading through
> the patch, see attached. Below is some feedback on the details:
>
> * I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an
> error, without propagating it further or rolling back the whole
> (sub)transation. It might work in this case, as you're only suppressing
> errors with the special sqlcode that are used in the same file, but it
> nevertheless feels naughty. I believe none of the limits that are being
> checked are strict; it's OK to exceed the limits somewhat, as long as you
> terminate the processing in a reasonable time, in case of pathological
> input. I'd suggest putting an explicit check for the limits somewhere, and
> not rely on ereport(). Something like this, in the code that recurses:
>
> if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
> hash_get_num_entries(trgmCNFA-**>states) > MAX_RESULT_STATES)
> {
> trgmCNFA->overflowed = true;
> return;
> }
>

PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes
a bit more complicated, but your notes does matter.

And then check for the overflowed flag at the top level.
>
> * This part of the high-level comment was not clear to me:
>
> * States of the graph produced in the first stage are marked with
>> "keys". Key is a pair
>> * of a "prefix" and a state of the original automaton. "Prefix" is a last
>> * characters. So, knowing the prefix is enough to know what is a trigram
>> when we read some new
>> * character. However, we can know single character of prefix or don't
>> know any
>> * characters of it. Each state of resulting graph have an "enter key"
>> (with that
>> * key we've entered this state) and a set of keys which are reachable
>> without
>> * reading any predictable trigram. The algorithm of processing each state
>> * of resulting graph are so:
>> * 1) Add all keys which achievable without reading of any predictable
>> trigram.
>> * 2) Add outgoing arcs labeled with trigrams.
>> * Step 2 leads to creation of new states and recursively processing
>> them. So,
>> * we use depth-first algorithm.
>>
>
> I didn't understand that. Can you elaborate? It might help to work through
> an example, with some ascii art depicting the graph.
>

This comment is extended with example.

> * It would be nice to add some comments to TrgmCNFA struct, explaining
> which fields are valid at which stages. For example, it seems that 'trgms'
> array is calculated only after building the CNFA, by getTrgmVector()
> function, while arcsCount is updated on the fly, while recursing in the
> getState() function.
>

TrgmCNFA comment is extended with this.

> * What is the representation used for the path matrix? Needs a comment.
>

Comments are added to PackedTrgmPaths and TrgmStatePath.

> * What do the getColorinfo()

getColorInfo comment now references to ColorInfo struct which have comment.

and scanColorMap() functions do?

scanColorMap comment now have description of colormap structure.

> What exactly does a color represent?

This is added to the top comment.

> What's the tradeoff in choosing MAX_COLOR_CHARS?

Comment is added to the macro.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm-regexp-0.6.patch.gz application/x-gzip 14.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2012-11-26 19:54:19 Re: WIP json generation enhancements
Previous Message Hannu Krosing 2012-11-26 19:46:36 Re: WIP json generation enhancements