Quick Links

Re: WIP: index support for regexp search

Lists:	pgsql-hackers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	WIP: index support for regexp search
Date:	2011-11-22 19:38:37
Message-ID:	CAPpHfduD6EGNise5codBz0KcdDahp7--MhFz_JDD_FRPC7-i=A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hackers,

WIP patch with index support for regexp search for pg_trgm contrib is
attached.
In spite of techniques which extracts continuous text parts from regexp,
this patch presents technique of automatum transformation. That allows more
comprehensive trigrams extraction.

A little example of possible perfomance benefit.

test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
QUERY PLAN

----------------------------------------------------------------------------------------------------
---
Seq Scan on words (cost=0.00..1703.11 rows=10 width=9) (actual
time=3.092..242.303 rows=662 loops=1)
Filter: (s ~ 'a[bc]+[de]'::text)
Rows Removed by Filter: 97907
Total runtime: 243.213 ms
(4 rows)

test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on words (cost=260.08..295.83 rows=10 width=9) (actual
time=4.166..7.506 rows=662 loops=1)
Recheck Cond: (s ~ 'a[bc]+[de]'::text)
Rows Removed by Index Recheck: 18
-> Bitmap Index Scan on words_trgm_idx (cost=0.00..260.07 rows=10
width=0) (actual time=4.076..4.076 rows=680 loops=1)
Index Cond: (s ~ 'a[bc]+[de]'::text)
Total runtime: 8.424 ms
(6 rows)

Current version of patch have some limitations:
1) Algorithm of logical expression extraction on trigrams have high
computational complexity. So, it can become really slow on regexp with many
branches. Probably, improvements of this algorithm is possible.
2) Surely, no perfomance benefit if no trigrams can be extracted from
regexp. It's inevitably.
3) Currently, only GIN index is supported. There are no serious problems,
GiST code for it just not written yet.
4) It appear to be some kind of problem to extract multibyte encoded
character from pg_wchar. I've posted question about it here:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
While I've hardcoded some dirty solution. So
PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
supported yet.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.1.patch.gz	application/x-gzip	6.0 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2011-11-30 20:29:02
Message-ID:	CA+TgmobdJ5Vg4dGRJGjcJ3g6GdTvBLMnPjPjrZEjRU4c0mt5Kg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 22, 2011 at 2:38 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> WIP patch with index support for regexp search for pg_trgm contrib is
> attached.
> In spite of techniques which extracts continuous text parts from regexp,
> this patch presents technique of automatum transformation. That allows more
> comprehensive trigrams extraction.

Please add this patch here so it does not get lost in the shuffle:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2011-12-01 13:34:58
Message-ID:	CAPpHfdsusG3rFVCxwvh6b8P6_=mNw5DU0qgm0wNwbt=BE89t9A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 1, 2011 at 12:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Please add this patch here so it does not get lost in the shuffle:
>
> https://commitfest.postgresql.org/action/commitfest_view/open

Done.

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-19 20:30:20
Message-ID:	4F187D5C.30701@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.11.2011 21:38, Alexander Korotkov wrote:
> WIP patch with index support for regexp search for pg_trgm contrib is
> attached.
> In spite of techniques which extracts continuous text parts from regexp,
> this patch presents technique of automatum transformation. That allows more
> comprehensive trigrams extraction.

Nice!

> Current version of patch have some limitations:
> 1) Algorithm of logical expression extraction on trigrams have high
> computational complexity. So, it can become really slow on regexp with many
> branches. Probably, improvements of this algorithm is possible.
> 2) Surely, no perfomance benefit if no trigrams can be extracted from
> regexp. It's inevitably.
> 3) Currently, only GIN index is supported. There are no serious problems,
> GiST code for it just not written yet.
> 4) It appear to be some kind of problem to extract multibyte encoded
> character from pg_wchar. I've posted question about it here:
> http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
> While I've hardcoded some dirty solution. So
> PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
> supported yet.

This is pretty far from being in committable state, so I'm going to mark
this as "returned with feedback" in the commitfest app. The feedback:

The code badly needs comments. There is no explanation of how the
trigram extraction code in trgm_regexp.c works. Guessing from the
variable names, it seems to be some sort of a coloring algorithm that
works on a graph, but that all needs to be explained. Can this algorithm
be found somewhere in literature, perhaps? A link to a paper would be nice.

Apart from that, the multibyte issue seems like the big one. Any way
around that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-19 20:54:24
Message-ID:	CAPpHfdvVu9J-2sf4T0NMh_fGdiMqXvdMQBY9P=doFw=Q6c0CUA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> The code badly needs comments. There is no explanation of how the trigram
>> extraction code in trgm_regexp.c works.
>
> Sure. I hoped to find a time for comments before commitfest starts.
Unfortunately I didn't, sorry.

> Guessing from the variable names, it seems to be some sort of a coloring
> algorithm that works on a graph, but that all needs to be explained. Can
> this algorithm be found somewhere in literature, perhaps? A link to a paper
> would be nice.
>
I hope it's truly novel. At least application to regular expressions. I'm
going to write a paper about it.

> Apart from that, the multibyte issue seems like the big one. Any way
> around that?

Conversion of pg_wchar to multibyte character is the only way I found to
avoid serious hacking of existing regexp code. Do you think additional
function in pg_wchar_tbl which converts pg_wchar back to multibyte
character is possible solution?

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-19 21:07:06
Message-ID:	CAPpHfdukmmFoQ3HBSsAnQ=ovuwxDkUTC4Zcfd+9yH0cwOaT0ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I also have a question about pg_wchar.

/*
*-------------------------------------------------------------------
* encoding info table
* XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
*-------------------------------------------------------------------
*/
pg_wchar_tbl pg_wchar_table[] = {
{pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen,
pg_ascii_verifier, 1}, /* PG_SQL_ASCII */
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JP */
{pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen,
pg_euccn_verifier, 2}, /* PG_EUC_CN */
{pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen,
pg_euckr_verifier, 3}, /* PG_EUC_KR */
{pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen,
pg_euctw_verifier, 4}, /* PG_EUC_TW */
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */
{pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /*
PG_UTF8 */
{pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier,
4}, /* PG_MULE_INTERNAL */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN1 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN2 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN3 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN4 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN5 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN6 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN7 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN8 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN9 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN10 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1256 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1258 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN866 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN874 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8R */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1251 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1252 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-5 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-6 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-7 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-8 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1250 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1253 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1254 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1255 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1257 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8U */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
{0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
{0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */
{0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */
{0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /*
PG_GB18030 */
{0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /*
PG_SHIFT_JIS_2004 */
};

What does last 7 zeros in the first column means? No conversion to pg_wchar
is possible from these encodings?

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-19 21:11:50
Message-ID:	CAPpHfdt60SVcy75yN_LmD-LM8Jgb0wXMvvnxLjE=n85fWt_3uA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 1:07 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> What does last 7 zeros in the first column means? No conversion to
> pg_wchar is possible from these encodings?
>
Uh, I see. These encodings is not supported as server encodings.

------
With best regards,
Alexander Korotkov.

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-19 23:33:55
Message-ID:	1ce6dc4baa5e8293bd44ee1458b98d2c.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
> On 22.11.2011 21:38, Alexander Korotkov wrote:
>> WIP patch with index support for regexp search for pg_trgm contrib is
>> attached.
>> In spite of techniques which extracts continuous text parts from regexp,
>> this patch presents technique of automatum transformation. That allows more
>> comprehensive trigrams extraction.
>
> Nice!
>

Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found.

The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a
good enough job when guessing where to use the index and where fall back to Seq Scan. This can
lead to (also spectacular) slow-downs, compared to Seq Scan.

I used the following to generate 3 test tables with lines of 80 random chars (just in case it's
handy for others to use):

$ cat create_data.sh
#!/bin/sh

for power in 4 5 6
do
table=azjunk${power}
index=${table}_trgmrgx_txt_01_idx
echo "-- generating table $table with index $index";
time perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z"))
for 1 .. 1e'"${power};" \
| psql -aqXc "drop table if exists $table;
create table $table(txt text); copy $table from stdin;
-- set session maintenance_work_mem='20GB';
create index $index on $table using gin(txt gin_trgm_ops);
analyze $table;";
done

I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that
true? I can understand you want that value to be low to limit the above risk, but now it reduces
the usability of the feature a bit: one has to split up larger char-classes into several smaller
ones to make a statement use the index: i.e.:
txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of
txt ~ 'f[aeiouy]n'

I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they
used the index for larger classes such as the above, but of course also got into problems easier
when quantifiers are added (*, +, {n,m}).

A better calculation to decide index-use seems necessary, and ideally one that allows for a larger
MAX_COLOR_CHARS than 4. Especially quantifiers could perhaps be inspected wrt that decision.
IMHO, the functionality would still be very useful when only very simple regexen were considered.

Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it. I don't know how hard that would be.

There is also a minor bug, I think, when running with 'set enable_seqscan=off' in combination
with a too-large regex:

$ cat fail.sql:

set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- OK

set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- crashes

$ psql -f fail.sql
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291
rows=131 loops=1)
Recheck Cond: (txt ~ 'f[aeio]n'::text)
-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual
time=0.880..0.880 rows=131 loops=1)
Index Cond: (txt ~ 'f[aeio]n'::text)
Total runtime: 5.700 ms
(5 rows)

QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164
loops=1)
Filter: (txt ~ 'f[aeiou]n'::text)
Rows Removed by Filter: 9836
Total runtime: 36.112 ms
(4 rows)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927
rows=131 loops=1)
Recheck Cond: (txt ~ 'f[aeio]n'::text)
-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual
time=0.316..0.316 rows=131 loops=1)
Index Cond: (txt ~ 'f[aeio]n'::text)
Total runtime: 0.996 ms
(5 rows)

psql:fail.sql:24: connection to server was lost

Sorry: I found the code, esp. the regex engine hard to understand sofar, so all the above are
somewhat inconclusive remarks; I hope nonetheless they are useful.

Thanks,

Erik Rijkers

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-20 16:45:53
Message-ID:	CABRT9RBmFkcPgjKarnkTJjnDSyL45pJG9pTmp0Fcod_-6rE89w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 01:33, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
> normally necessary for perfomance reasons, but it would be useful te be able to compile a test
> version that allows it.

I believe being interruptible is a requirement for the patch to be accepted.

CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.

Regards,
Marti

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-21 05:26:36
Message-ID:	CAPpHfdv5xmBoTCkxuFBnD8LFtD8-DALQaqrGT7tRLLHKYg-OSQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Thank you for your feedback!

On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> The patch yields spectacular speedups with small, simple-enough regexen.
> But it does not do a
> good enough job when guessing where to use the index and where fall back
> to Seq Scan. This can
> lead to (also spectacular) slow-downs, compared to Seq Scan.
>
Could you give some examples of regexes where index scan becomes slower
than seq scan?

> I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in
> your patch), is that
> true? I can understand you want that value to be low to limit the above
> risk, but now it reduces
> the usability of the feature a bit: one has to split up larger
> char-classes into several smaller
> ones to make a statement use the index: i.e.:
>
Yes, MAX_COLOR_CHARS is number of maximum character in automata color when
that color is divided to a separated characters. And it's likely there
could be better solution than just have this hard limit.

> Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I
> suppose this is
> normally necessary for perfomance reasons, but it would be useful te be
> able to compile a test
> version that allows it. I don't know how hard that would be.
>
I seems that Ctrl-C was impossible because procedure of trigrams
exctraction becomes so long while it is not breakable. It's not difficult
to make this procedure breakable, but actually it just shouldn't take so
long.

> There is also a minor bug, I think, when running with 'set
> enable_seqscan=off' in combination
> with a too-large regex:
>
Thanks for pointing. Will be fixed.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-21 05:28:20
Message-ID:	CAPpHfdsV2znvS-Lhv7_zck17KeGYoqMwr1DrC_6Pn2m__n4NLw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 8:45 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Fri, Jan 20, 2012 at 01:33, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> > Btw, it seems impossible to Ctrl-C out of a search once it is submitted;
> I suppose this is
> > normally necessary for perfomance reasons, but it would be useful te be
> able to compile a test
> > version that allows it.
>
> I believe being interruptible is a requirement for the patch to be
> accepted.
>
> CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.
>
Sure. It's easy to fix. But it seems that in this case gin extract_query
method becomes slow (because index scan itself is breakable). So, it just
shouldn't work so long.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-21 05:29:56
Message-ID:	CAPpHfduygsmDjmiX=9crav4Y=Nc4MVL1Yj-goWihJ38hgs+LxQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2012 at 12:54 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com>wrote:

> On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas <
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> Apart from that, the multibyte issue seems like the big one. Any way
>>> around that?
>>
>> Conversion of pg_wchar to multibyte character is the only way I found to
> avoid serious hacking of existing regexp code. Do you think additional
> function in pg_wchar_tbl which converts pg_wchar back to multibyte
> character is possible solution?
>
Do you have any notes on it? I could make the patch which adds such
function into core.

------
With best regards,
Alexander Korotkov.

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-01-21 07:03:04
Message-ID:	d9141ddf8c4e763f57f41661c9af699d.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, January 21, 2012 06:26, Alexander Korotkov wrote:
> Hi!
>
> Thank you for your feedback!
>
> On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
>> The patch yields spectacular speedups with small, simple-enough regexen.
>> But it does not do a
>> good enough job when guessing where to use the index and where fall back
>> to Seq Scan. This can
>> lead to (also spectacular) slow-downs, compared to Seq Scan.
>>
> Could you give some examples of regexes where index scan becomes slower
> than seq scan?
>

x[aeio]+q takes many minutes, uninterruptible. It's now running for almost 30 minutes, I'm sure
it will come back eventually, but I think I'll kill it later on; I suppose you get the point ;-)

Even with {n,m} quantifiers it's easy to hit slowdowns:

(table azjunk6 is 112 MB, the index 693 MB.)
(MAX_COLOR_CHARS=4 <= your original patch)

instance table regex plan time

HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3566.088 ms
HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3540.606 ms
HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3495.034 ms
HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3510.403 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3724.131 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3844.999 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3835.190 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3724.016 ms

HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3617.997 ms
HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3644.215 ms
HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3636.976 ms
HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3625.493 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7885.247 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 8799.082 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7754.152 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7721.332 ms

This is with your patch as is; in instances compiled with higher MAX_COLOR_CHARS (I did 6 and 9),
it is of course even easier to dream up a slow regex...

Erik Rijkers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, hlinnakangas(at)vmware(dot)com, tv(at)fuzzy(dot)cz
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-19 21:58:19
Message-ID:	CAPpHfduDAwWBE+mL9cNKh4eDVMzHLxzwFN-NCcFFsThK=c9c+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

New version of patch is attached. Changes are following:
1) Right way to convert from pg_wchar to multibyte.
2) Optimization of producing CFNA-like graph on trigrams (produce smaller,
but equivalent, graphs in less time).
3) Comments and refactoring.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.2.patch.gz	application/x-gzip	9.3 KB

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-19 22:51:55
Message-ID:	9472db31869cee3a8305bb30d969a6c0.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, November 19, 2012 22:58, Alexander Korotkov wrote:

> New version of patch is attached.

Hi Alexander,

I get some compile-errors:

(Centos 6.3, Linux 2.6.32-279.14.1.el6.x86_64 GNU/Linux, gcc (GCC) 4.7.2)

make contrib
trgm_regexp.c:73:2: error: unknown type name TrgmStateKey
make[1]: *** [trgm_regexp.o] Error 1
make: *** [all-pg_trgm-recurse] Error 2
trgm_regexp.c:73:2: error: unknown type name TrgmStateKey
make[1]: *** [trgm_regexp.o] Error 1
make: *** [install-pg_trgm-recurse] Error 2

Did I forget something?

Thanks,

Erik Rijkers

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-19 23:02:19
Message-ID:	50AABA7B.8060302@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19.11.2012 22:58, Alexander Korotkov wrote:
> Hi!
>
> New version of patch is attached. Changes are following:
> 1) Right way to convert from pg_wchar to multibyte.
> 2) Optimization of producing CFNA-like graph on trigrams (produce
> smaller, but equivalent, graphs in less time).
> 3) Comments and refactoring.

Hi,

thanks for the updated message-id. I've done the initial review:

1) Patch applies fine to the master.

2) It's common to use upper-case names for macros, but trgm.h defines
macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
could make it a bit more correct?

3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
other in trgm.h - maybe it'd be a good idea to join them?

4) The two new method prototypes at the end of trgm.h use different
indendation than the rest (spaces only instead of tabs).

5) There are no regression tests / updated docs (yet).

6) It does not compile - I do get a bunch of errors like this

trgm_regexp.c:73:2: error: expected specifier-qualifier-list before
‘TrgmStateKey’
trgm_regexp.c: In function ‘addKeys’:
trgm_regexp.c:291:24: error: ‘TrgmState’ has no member named ‘keys’
trgm_regexp.c:304:10: error: ‘TrgmState’ has no member named ‘keys’
...

It seems this is cause by the order of typedefs in trgm_regexp.c, namely
TrgmState referencing TrgmStateKey before it's defined. Moving the
TrgmStateKey before TrgmState fixed the issue (I'm using gcc-4.5.4 but
I think it's not compiler-dependent.)

7) Once fixed, it seems to work

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';

QUERY PLAN
---------------------------------------------------------------------
Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33)
Recheck Cond: (val ~ '.*qqq.*'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100
width=0)
Index Cond: (val ~ '.*qqq.*'::text)
(4 rows)

but I do get a bunch of NOTICE messages with debugging info (no matter
if the GIN index is used or not, so it's somewhere in the common regexp
code). But I guess that's due to WIP status.

regards
Tomas

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-19 23:08:34
Message-ID:	CAPpHfdvPiGh0b0u=iuvKAOXbuX49EuN-NBLrEF3MFA6xYDh93g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Some quick comments.

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> 6) It does not compile - I do get a bunch of errors like this
>
Fixed.

7) Once fixed, it seems to work
>
> CREATE EXTENSION pg_trgm ;
> CREATE TABLE TEST (val TEXT);
> INSERT INTO test
> SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
> CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
> ANALYZE test;
>
> EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';
>
>
> QUERY PLAN
> ---------------------------------------------------------------------
> Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33)
> Recheck Cond: (val ~ '.*qqq.*'::text)
> -> Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100
> width=0)
> Index Cond: (val ~ '.*qqq.*'::text)
> (4 rows)
>
> but I do get a bunch of NOTICE messages with debugging info (no matter
> if the GIN index is used or not, so it's somewhere in the common regexp
> code). But I guess that's due to WIP status.
>
It's due to TRGM_REGEXP_DEBUG macro. I disabled it by default. But I think
pieces of code hidden by that macro could be useful for debug even after
WIP status.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.3.patch.gz	application/x-gzip	9.2 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-20 09:43:42
Message-ID:	50AB50CE.8030902@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Glad to see this patch hasn't been totally forgotten. Being able to use
indexes for regular expressions would be really cool!

Back in January, I asked for some high-level description of how the
algorithm works
(http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com).
That's still sorely needed. Googling around, I found the slides for your
presentation on this from PGConf.EU - it would be great to have the
information from that presentation included in the patch.

- Heikki

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-20 14:25:36
Message-ID:	CAFj8pRCtkbsE1fPRVo5S=V6DUhXT_4JDmmsbDjkgQoF_DU6=YQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello

I tested it now and it working very well!

tested utf8, case sensitive, case insensitive

Regards

Pavel Stehule

2012/11/20 Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>:
> Glad to see this patch hasn't been totally forgotten. Being able to use
> indexes for regular expressions would be really cool!
>
> Back in January, I asked for some high-level description of how the
> algorithm works
> (http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com).
> That's still sorely needed. Googling around, I found the slides for your
> presentation on this from PGConf.EU - it would be great to have the
> information from that presentation included in the patch.
>
> - Heikki
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-20 20:01:51
Message-ID:	CAPpHfdtEhR=vhkPdTVguWK94d1rFPuDtAT1W2FO-xfm21uZw=w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> 2) It's common to use upper-case names for macros, but trgm.h defines
> macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
> could make it a bit more correct?
>
> 3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
> other in trgm.h - maybe it'd be a good idea to join them?
>
> 4) The two new method prototypes at the end of trgm.h use different
> indendation than the rest (spaces only instead of tabs).
>
These issues are fixed in attached patch.
Additionally 3 new macros are
introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are
limiting resources usage during regex processing.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.4.patch.gz	application/x-gzip	9.9 KB

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-20 20:51:08
Message-ID:	CAFj8pRBB0U26fK5iuXSZkOrGR3SEWq3iJRjR7zmXHTNKHR5fnA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

hello

do you plan to support GiST?

Regards

Pavel

2012/11/20 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>
>> 2) It's common to use upper-case names for macros, but trgm.h defines
>> macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
>> could make it a bit more correct?
>>
>> 3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
>> other in trgm.h - maybe it'd be a good idea to join them?
>>
>> 4) The two new method prototypes at the end of trgm.h use different
>> indendation than the rest (spaces only instead of tabs).
>
> These issues are fixed in attached patch.
> Additionally 3 new macros are introduced: MAX_RESULT_STATES,
> MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during
> regex processing.
>
> ------
> With best regards,
> Alexander Korotkov.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-25 20:55:38
Message-ID:	CAPpHfdsrok2PE9hcwmXNCOe8NiihM1_BZDgoV1L1HLVX1+GC9A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> Glad to see this patch hasn't been totally forgotten. Being able to use
> indexes for regular expressions would be really cool!
>
> Back in January, I asked for some high-level description of how the
> algorithm works (http://archives.postgresql.**
> org/message-id/4F187D5C.30701@**enterprisedb.com<http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com>).
> That's still sorely needed. Googling around, I found the slides for your
> presentation on this from PGConf.EU - it would be great to have the
> information from that presentation included in the patch.

New version of patch is attached. The changes are following:
1) A big comment with high-level description of what is going on.
2) Regression tests.
3) Documetation update.
4) Some more refactoring.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.5.patch.gz	application/x-gzip	13.5 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-25 21:03:17
Message-ID:	CAPpHfdvYtQXNm-vMXCuGGEnh4a1x_JgHV_ZFpC+8mmZnquN9ig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Wed, Nov 21, 2012 at 12:51 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:

> do you plan to support GiST?
>

At first, I would note that pg_trgm GiST opclass is quite ridiculous for
support regex search (and, actually for LIKE/ILIKE search which is already
implemented too). Because in GiST opclass we store set of trigrams in leaf
pages. In was designed for trigram similarity search and have sense for it
because of elimination of trigram set computation. But for regex or
LIKE/ILIKE search this representation is both lossy and bigger than just
original string. Probably we could think about another opclass for GiST
focusing on regex and LIKE/ILIKE search?

However, amyway I can create additional patch for current GiST opclass.

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-26 12:55:19
Message-ID:	50B366B7.5060309@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 25.11.2012 22:55, Alexander Korotkov wrote:
> On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> Glad to see this patch hasn't been totally forgotten. Being able to use
>> indexes for regular expressions would be really cool!
>>
>> Back in January, I asked for some high-level description of how the
>> algorithm works (http://archives.postgresql.**
>> org/message-id/4F187D5C.30701@**enterprisedb.com<http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com>).
>> That's still sorely needed. Googling around, I found the slides for your
>> presentation on this from PGConf.EU - it would be great to have the
>> information from that presentation included in the patch.
>
>
> New version of patch is attached. The changes are following:
> 1) A big comment with high-level description of what is going on.
> 2) Regression tests.
> 3) Documetation update.
> 4) Some more refactoring.

Great, that top-level comment helped tremendously! I feel enlightened.

I fixed some spelling, formatting etc. trivial stuff while reading
through the patch, see attached. Below is some feedback on the details:

* I don't like the PG_TRY/CATCH trick. It's not generally safe to catch
an error, without propagating it further or rolling back the whole
(sub)transation. It might work in this case, as you're only suppressing
errors with the special sqlcode that are used in the same file, but it
nevertheless feels naughty. I believe none of the limits that are being
checked are strict; it's OK to exceed the limits somewhat, as long as
you terminate the processing in a reasonable time, in case of
pathological input. I'd suggest putting an explicit check for the limits
somewhere, and not rely on ereport(). Something like this, in the code
that recurses:

if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
hash_get_num_entries(trgmCNFA->states) > MAX_RESULT_STATES)
{
trgmCNFA->overflowed = true;
return;
}

And then check for the overflowed flag at the top level.

* This part of the high-level comment was not clear to me:

> * States of the graph produced in the first stage are marked with "keys". Key is a pair
> * of a "prefix" and a state of the original automaton. "Prefix" is a last
> * characters. So, knowing the prefix is enough to know what is a trigram when we read some new
> * character. However, we can know single character of prefix or don't know any
> * characters of it. Each state of resulting graph have an "enter key" (with that
> * key we've entered this state) and a set of keys which are reachable without
> * reading any predictable trigram. The algorithm of processing each state
> * of resulting graph are so:
> * 1) Add all keys which achievable without reading of any predictable trigram.
> * 2) Add outgoing arcs labeled with trigrams.
> * Step 2 leads to creation of new states and recursively processing them. So,
> * we use depth-first algorithm.

I didn't understand that. Can you elaborate? It might help to work
through an example, with some ascii art depicting the graph.

* It would be nice to add some comments to TrgmCNFA struct, explaining
which fields are valid at which stages. For example, it seems that
'trgms' array is calculated only after building the CNFA, by
getTrgmVector() function, while arcsCount is updated on the fly, while
recursing in the getState() function.

* What is the representation used for the path matrix? Needs a comment.

* What do the getColorinfo() and scanColorMap() functions do? What
exactly does a color represent? What's the tradeoff in choosing
MAX_COLOR_CHARS?

- Heikki

Attachment	Content-Type	Size
trgm-regexp-0.5-heikki1.patch	text/x-diff	44.8 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-26 19:49:23
Message-ID:	CAPpHfdvwGTEXHnxJXe-kne-Sr-DmJ0EiWJVYukgQR9Ophj3z=A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:
>
> Great, that top-level comment helped tremendously! I feel enlightened.
>
> I fixed some spelling, formatting etc. trivial stuff while reading through
> the patch, see attached. Below is some feedback on the details:
>
> * I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an
> error, without propagating it further or rolling back the whole
> (sub)transation. It might work in this case, as you're only suppressing
> errors with the special sqlcode that are used in the same file, but it
> nevertheless feels naughty. I believe none of the limits that are being
> checked are strict; it's OK to exceed the limits somewhat, as long as you
> terminate the processing in a reasonable time, in case of pathological
> input. I'd suggest putting an explicit check for the limits somewhere, and
> not rely on ereport(). Something like this, in the code that recurses:
>
> if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
> hash_get_num_entries(trgmCNFA-**>states) > MAX_RESULT_STATES)
> {
> trgmCNFA->overflowed = true;
> return;
> }
>

PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes
a bit more complicated, but your notes does matter.

And then check for the overflowed flag at the top level.
>
> * This part of the high-level comment was not clear to me:
>
> * States of the graph produced in the first stage are marked with
>> "keys". Key is a pair
>> * of a "prefix" and a state of the original automaton. "Prefix" is a last
>> * characters. So, knowing the prefix is enough to know what is a trigram
>> when we read some new
>> * character. However, we can know single character of prefix or don't
>> know any
>> * characters of it. Each state of resulting graph have an "enter key"
>> (with that
>> * key we've entered this state) and a set of keys which are reachable
>> without
>> * reading any predictable trigram. The algorithm of processing each state
>> * of resulting graph are so:
>> * 1) Add all keys which achievable without reading of any predictable
>> trigram.
>> * 2) Add outgoing arcs labeled with trigrams.
>> * Step 2 leads to creation of new states and recursively processing
>> them. So,
>> * we use depth-first algorithm.
>>
>
> I didn't understand that. Can you elaborate? It might help to work through
> an example, with some ascii art depicting the graph.
>

This comment is extended with example.

> * It would be nice to add some comments to TrgmCNFA struct, explaining
> which fields are valid at which stages. For example, it seems that 'trgms'
> array is calculated only after building the CNFA, by getTrgmVector()
> function, while arcsCount is updated on the fly, while recursing in the
> getState() function.
>

TrgmCNFA comment is extended with this.

> * What is the representation used for the path matrix? Needs a comment.
>

Comments are added to PackedTrgmPaths and TrgmStatePath.

> * What do the getColorinfo()

getColorInfo comment now references to ColorInfo struct which have comment.

and scanColorMap() functions do?

scanColorMap comment now have description of colormap structure.

> What exactly does a color represent?

This is added to the top comment.

> What's the tradeoff in choosing MAX_COLOR_CHARS?

Comment is added to the macro.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.6.patch.gz	application/x-gzip	14.5 KB

From:	"er" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-29 08:58:52
Message-ID:	e752b18ec8a2a95a200ddea5b68fd27b.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:

> trgm-regexp-0.6.patch.gz

I ran the simple-minded tests against generated data (similar to the ones I did in January 2012).
The problems of that older version seem pretty much all removed. (although I didn't do much work
on it -- just reran these tests).

I used two 2 instances, 'HEAD' and 'trgm_regex', which were both compiled with

'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

Tables used:

rowcount size table size index (trgm)

azjunk4 10^4 rows 1,171,456 | 9,781,248
azjunk5 10^5 rows 11,706,368 | 65,093,632
azjunk6 10^6 rows 117,030,912 | 726,310,912
azjunk7 10^7 rows 1,170,292,736 | 4,976,189,440

(See my previous emails for a generating script)

Tables contain random generated text:

table azjunk7 limit 5;
txt
----------------------------------------------------------------------------------
i kzzhv ssaa zv x xlepzxsgbdkxev v wn dmvqkuwb qxkyvgab gpidaosaqbewqimmai jxj
bvwn zbevtzyhibbn hoctxurutn pvlatjjyxf f runa owpltbcunrbq ux peoook rxwoscbytz
bbjlbbhhkivjivklgbh tvapzogh rj ky ahvgkvvlfudotvqapznludohdoyqrp kvothyclbckbxu
hvic gomewbp izsjifqggyqgzcghdat lb kud ltfqaxqxjjom qkw wqggikgvph yi sftmbjt
edbjfl vtcasudjpgfgjaf swooxygsse flnqd pxzsdmesqhqbzgirswysote muakq agk p w uq
(5 rows)

with index on column 'txt':

create index az7_idx on azjunk7 using gin (txt gin_trgm_ops);

Queries were of the form:

explain analyze select txt from azjunkXX where txt ~ '$REGEX';

The main problem with the January version was that it chose to use the trgm index even when it
could take a long time (hours). This has been resolved as far as I can see, and the results are
now very attractive.

(There does seem to be a very slight regression on the seqscan, but it's so small that I'm not yet
sure it's not noise)

Hardware: AMD FX-8120 with Linux 2.6.32-279.14.1.el6.x86_64 x86_64 GNU/Linux

PostgreSQL 9.3devel-trgm_regex-20121127_2353-e78d288c895bd296e3cb1ca29c7fe2431eef3fcd on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.7.2, 64-bit

port instance table regex rows method expl.analyze timing

6543 HEAD azjunk4 x[ae]q 46 Seq Scan 12.962 ms
6554 trgm_regex azjunk4 x[ae]q 46 Bitmap Heap Scan 0.800 ms

6543 HEAD azjunk4 x[ae]{1}q 46 Seq Scan 12.487 ms
6554 trgm_regex azjunk4 x[ae]{1}q 46 Bitmap Heap Scan 0.209 ms

6543 HEAD azjunk4 x[ae]{1,1}q 46 Seq Scan 12.266 ms
6554 trgm_regex azjunk4 x[ae]{1,1}q 46 Bitmap Heap Scan 0.210 ms

6543 HEAD azjunk4 x[ae]{,2}q 0 Seq Scan 14.322 ms
6554 trgm_regex azjunk4 x[ae]{,2}q 0 Bitmap Heap Scan 0.610 ms

6543 HEAD azjunk4 x[ae]{,10}q 0 Seq Scan 20.503 ms
6554 trgm_regex azjunk4 x[ae]{,10}q 0 Bitmap Heap Scan 0.511 ms

6543 HEAD azjunk4 x[ae]{1,2}q 49 Seq Scan 13.060 ms
6554 trgm_regex azjunk4 x[ae]{1,2}q 49 Bitmap Heap Scan 0.429 ms

6543 HEAD azjunk4 x[aei]q 81 Seq Scan 12.487 ms
6554 trgm_regex azjunk4 x[aei]q 81 Bitmap Heap Scan 0.367 ms

6543 HEAD azjunk4 x[aei]{1}q 81 Seq Scan 12.132 ms
6554 trgm_regex azjunk4 x[aei]{1}q 81 Bitmap Heap Scan 0.336 ms

6543 HEAD azjunk4 x[aei]{1,1}q 81 Seq Scan 12.168 ms
6554 trgm_regex azjunk4 x[aei]{1,1}q 81 Bitmap Heap Scan 0.319 ms

6543 HEAD azjunk4 x[aei]{,2}q 0 Seq Scan 14.586 ms
6554 trgm_regex azjunk4 x[aei]{,2}q 0 Bitmap Heap Scan 0.621 ms

6543 HEAD azjunk4 x[aei]{,10}q 0 Seq Scan 20.134 ms
6554 trgm_regex azjunk4 x[aei]{,10}q 0 Bitmap Heap Scan 0.552 ms

6543 HEAD azjunk4 x[aei]{1,2}q 89 Seq Scan 12.553 ms
6554 trgm_regex azjunk4 x[aei]{1,2}q 89 Bitmap Heap Scan 0.916 ms

6543 HEAD azjunk4 x[aei]{1,3}q 89 Seq Scan 13.055 ms
6554 trgm_regex azjunk4 x[aei]{1,3}q 89 Seq Scan 13.064 ms

6543 HEAD azjunk4 x[aei]q 81 Seq Scan 11.856 ms
6554 trgm_regex azjunk4 x[aei]q 81 Bitmap Heap Scan 0.398 ms

6543 HEAD azjunk4 x[aei]{1}q 81 Seq Scan 11.951 ms
6554 trgm_regex azjunk4 x[aei]{1}q 81 Bitmap Heap Scan 0.369 ms

6543 HEAD azjunk4 x[aei]{1,1}q 81 Seq Scan 12.750 ms
6554 trgm_regex azjunk4 x[aei]{1,1}q 81 Bitmap Heap Scan 0.355 ms

6543 HEAD azjunk4 x[aei]{,2}q 0 Seq Scan 14.032 ms
6554 trgm_regex azjunk4 x[aei]{,2}q 0 Bitmap Heap Scan 0.540 ms

6543 HEAD azjunk4 x[aei]{,10}q 0 Seq Scan 20.377 ms
6554 trgm_regex azjunk4 x[aei]{,10}q 0 Bitmap Heap Scan 0.550 ms

6543 HEAD azjunk4 x[aei]{1,2}q 89 Seq Scan 12.706 ms
6554 trgm_regex azjunk4 x[aei]{1,2}q 89 Bitmap Heap Scan 0.969 ms

6543 HEAD azjunk4 x[aei]{1,3}q 89 Seq Scan 13.127 ms
6554 trgm_regex azjunk4 x[aei]{1,3}q 89 Seq Scan 13.025 ms

6543 HEAD azjunk4 x[aeio]q 105 Seq Scan 12.533 ms
6554 trgm_regex azjunk4 x[aeio]q 105 Bitmap Heap Scan 0.391 ms

6543 HEAD azjunk4 x[aeio]{1}q 105 Seq Scan 12.532 ms
6554 trgm_regex azjunk4 x[aeio]{1}q 105 Bitmap Heap Scan 0.362 ms

6543 HEAD azjunk4 x[aeio]{1,1}q 105 Seq Scan 12.323 ms
6554 trgm_regex azjunk4 x[aeio]{1,1}q 105 Bitmap Heap Scan 0.449 ms

6543 HEAD azjunk4 x[aeio]{,2}q 0 Seq Scan 14.417 ms
6554 trgm_regex azjunk4 x[aeio]{,2}q 0 Bitmap Heap Scan 0.844 ms

6543 HEAD azjunk4 x[aeio]{,10}q 0 Seq Scan 23.056 ms
6554 trgm_regex azjunk4 x[aeio]{,10}q 0 Bitmap Heap Scan 0.668 ms

6543 HEAD azjunk4 x[aeio]{1,2}q 121 Seq Scan 13.072 ms
6554 trgm_regex azjunk4 x[aeio]{1,2}q 121 Seq Scan 13.750 ms

6543 HEAD azjunk4 x[aeio]{1,3}q 123 Seq Scan 12.916 ms
6554 trgm_regex azjunk4 x[aeio]{1,3}q 123 Seq Scan 13.078 ms

6543 HEAD azjunk4 x[aeio]{1,4}q 124 Seq Scan 13.478 ms
6554 trgm_regex azjunk4 x[aeio]{1,4}q 124 Seq Scan 14.334 ms

6543 HEAD azjunk4 x[aeio]{2,4}q 19 Seq Scan 13.922 ms
6554 trgm_regex azjunk4 x[aeio]{2,4}q 19 Seq Scan 13.503 ms

6543 HEAD azjunk4 x[aeio]{3,4}q 3 Seq Scan 14.325 ms
6554 trgm_regex azjunk4 x[aeio]{3,4}q 3 Seq Scan 13.429 ms

6543 HEAD azjunk4 x[aeiou]q 134 Seq Scan 12.356 ms
6554 trgm_regex azjunk4 x[aeiou]q 134 Seq Scan 13.215 ms

6543 HEAD azjunk4 x[aeiou]{1}q 134 Seq Scan 13.005 ms
6554 trgm_regex azjunk4 x[aeiou]{1}q 134 Seq Scan 12.893 ms

6543 HEAD azjunk4 x[aeiou]{1,1}q 134 Seq Scan 12.430 ms
6554 trgm_regex azjunk4 x[aeiou]{1,1}q 134 Seq Scan 13.108 ms

6543 HEAD azjunk4 x[aeiou]{,2}q 0 Seq Scan 14.486 ms
6554 trgm_regex azjunk4 x[aeiou]{,2}q 0 Bitmap Heap Scan 0.349 ms

6543 HEAD azjunk4 x[aeiou]{,10}q 0 Seq Scan 21.597 ms
6554 trgm_regex azjunk4 x[aeiou]{,10}q 0 Bitmap Heap Scan 0.363 ms

6543 HEAD azjunk4 x[aeiou]{1,2}q 156 Seq Scan 13.069 ms
6554 trgm_regex azjunk4 x[aeiou]{1,2}q 156 Seq Scan 13.879 ms

6543 HEAD azjunk4 x[aeiou]{1,3}q 160 Seq Scan 13.005 ms
6554 trgm_regex azjunk4 x[aeiou]{1,3}q 160 Seq Scan 14.016 ms

6543 HEAD azjunk4 x[aeiou]{1,4}q 161 Seq Scan 13.603 ms
6554 trgm_regex azjunk4 x[aeiou]{1,4}q 161 Seq Scan 14.667 ms

6543 HEAD azjunk4 x[aeiou]{2,4}q 27 Seq Scan 13.656 ms
6554 trgm_regex azjunk4 x[aeiou]{2,4}q 27 Seq Scan 14.113 ms

6543 HEAD azjunk4 x[aeiou]{3,4}q 5 Seq Scan 13.541 ms
6554 trgm_regex azjunk4 x[aeiou]{3,4}q 5 Seq Scan 14.265 ms

6543 HEAD azjunk4 x[aeiou]{1,5}q 162 Seq Scan 13.750 ms
6554 trgm_regex azjunk4 x[aeiou]{1,5}q 162 Seq Scan 13.858 ms

6543 HEAD azjunk4 x[aeiou]{2,5}q 28 Seq Scan 13.745 ms
6554 trgm_regex azjunk4 x[aeiou]{2,5}q 28 Seq Scan 13.680 ms

6543 HEAD azjunk4 x[aeiou]{4,5}q 2 Seq Scan 13.577 ms
6554 trgm_regex azjunk4 x[aeiou]{4,5}q 2 Seq Scan 13.755 ms

6543 HEAD azjunk4 x[aeiouy]q 173 Seq Scan 12.106 ms
6554 trgm_regex azjunk4 x[aeiouy]q 173 Seq Scan 12.655 ms

6543 HEAD azjunk4 x[aeiouy]{1}q 173 Seq Scan 12.488 ms
6554 trgm_regex azjunk4 x[aeiouy]{1}q 173 Seq Scan 12.723 ms

6543 HEAD azjunk4 x[aeiouy]{1,1}q 173 Seq Scan 12.525 ms
6554 trgm_regex azjunk4 x[aeiouy]{1,1}q 173 Seq Scan 13.957 ms

6543 HEAD azjunk4 x[aeiouy]{,2}q 0 Seq Scan 14.574 ms
6554 trgm_regex azjunk4 x[aeiouy]{,2}q 0 Bitmap Heap Scan 0.295 ms

6543 HEAD azjunk4 x[aeiouy]{,10}q 0 Seq Scan 20.916 ms
6554 trgm_regex azjunk4 x[aeiouy]{,10}q 0 Bitmap Heap Scan 0.311 ms

6543 HEAD azjunk4 x[aeiouy]{1,2}q 204 Seq Scan 12.730 ms
6554 trgm_regex azjunk4 x[aeiouy]{1,2}q 204 Seq Scan 13.392 ms

6543 HEAD azjunk4 x[aeiouy]{1,3}q 215 Seq Scan 12.824 ms
6554 trgm_regex azjunk4 x[aeiouy]{1,3}q 215 Seq Scan 13.083 ms

6543 HEAD azjunk4 x[aeiouy]{1,4}q 218 Seq Scan 13.985 ms
6554 trgm_regex azjunk4 x[aeiouy]{1,4}q 218 Seq Scan 13.890 ms

6543 HEAD azjunk4 x[aeiouy]{2,4}q 46 Seq Scan 13.735 ms
6554 trgm_regex azjunk4 x[aeiouy]{2,4}q 46 Seq Scan 13.724 ms

6543 HEAD azjunk4 x[aeiouy]{3,4}q 14 Seq Scan 13.470 ms
6554 trgm_regex azjunk4 x[aeiouy]{3,4}q 14 Seq Scan 14.046 ms

6543 HEAD azjunk4 x[aeiouy]{1,5}q 219 Seq Scan 14.245 ms
6554 trgm_regex azjunk4 x[aeiouy]{1,5}q 219 Seq Scan 14.370 ms

6543 HEAD azjunk4 x[aeiouy]{2,5}q 47 Seq Scan 13.483 ms
6554 trgm_regex azjunk4 x[aeiouy]{2,5}q 47 Seq Scan 15.065 ms

6543 HEAD azjunk4 x[aeiouy]{4,5}q 4 Seq Scan 13.394 ms
6554 trgm_regex azjunk4 x[aeiouy]{4,5}q 4 Seq Scan 14.158 ms

6543 HEAD azjunk5 x[ae]q 677 Seq Scan 114.862 ms
6554 trgm_regex azjunk5 x[ae]q 677 Bitmap Heap Scan 2.213 ms

6543 HEAD azjunk5 x[ae]{1}q 677 Seq Scan 119.024 ms
6554 trgm_regex azjunk5 x[ae]{1}q 677 Bitmap Heap Scan 1.800 ms

6543 HEAD azjunk5 x[ae]{1,1}q 677 Seq Scan 118.500 ms
6554 trgm_regex azjunk5 x[ae]{1,1}q 677 Bitmap Heap Scan 1.788 ms

6543 HEAD azjunk5 x[ae]{,2}q 0 Seq Scan 138.023 ms
6554 trgm_regex azjunk5 x[ae]{,2}q 0 Bitmap Heap Scan 9.822 ms

6543 HEAD azjunk5 x[ae]{,10}q 0 Seq Scan 223.479 ms
6554 trgm_regex azjunk5 x[ae]{,10}q 0 Bitmap Heap Scan 9.141 ms

6543 HEAD azjunk5 x[ae]{1,2}q 723 Seq Scan 122.973 ms
6554 trgm_regex azjunk5 x[ae]{1,2}q 723 Bitmap Heap Scan 3.865 ms

6543 HEAD azjunk5 x[aei]q 982 Seq Scan 121.424 ms
6554 trgm_regex azjunk5 x[aei]q 982 Bitmap Heap Scan 2.639 ms

6543 HEAD azjunk5 x[aei]{1}q 982 Seq Scan 119.213 ms
6554 trgm_regex azjunk5 x[aei]{1}q 982 Bitmap Heap Scan 2.769 ms

6543 HEAD azjunk5 x[aei]{1,1}q 982 Seq Scan 121.673 ms
6554 trgm_regex azjunk5 x[aei]{1,1}q 982 Bitmap Heap Scan 2.657 ms

6543 HEAD azjunk5 x[aei]{,2}q 0 Seq Scan 142.256 ms
6554 trgm_regex azjunk5 x[aei]{,2}q 0 Bitmap Heap Scan 5.588 ms

6543 HEAD azjunk5 x[aei]{,10}q 0 Seq Scan 214.769 ms
6554 trgm_regex azjunk5 x[aei]{,10}q 0 Bitmap Heap Scan 9.007 ms

6543 HEAD azjunk5 x[aei]{1,2}q 1075 Seq Scan 128.672 ms
6554 trgm_regex azjunk5 x[aei]{1,2}q 1075 Bitmap Heap Scan 8.079 ms

6543 HEAD azjunk5 x[aei]{1,3}q 1086 Seq Scan 127.069 ms
6554 trgm_regex azjunk5 x[aei]{1,3}q 1086 Bitmap Heap Scan 27.654 ms

6543 HEAD azjunk5 x[aei]q 982 Seq Scan 121.431 ms
6554 trgm_regex azjunk5 x[aei]q 982 Bitmap Heap Scan 2.782 ms

6543 HEAD azjunk5 x[aei]{1}q 982 Seq Scan 121.270 ms
6554 trgm_regex azjunk5 x[aei]{1}q 982 Bitmap Heap Scan 2.603 ms

6543 HEAD azjunk5 x[aei]{1,1}q 982 Seq Scan 120.032 ms
6554 trgm_regex azjunk5 x[aei]{1,1}q 982 Bitmap Heap Scan 2.627 ms

6543 HEAD azjunk5 x[aei]{,2}q 0 Seq Scan 143.379 ms
6554 trgm_regex azjunk5 x[aei]{,2}q 0 Bitmap Heap Scan 4.906 ms

6543 HEAD azjunk5 x[aei]{,10}q 0 Seq Scan 196.212 ms
6554 trgm_regex azjunk5 x[aei]{,10}q 0 Bitmap Heap Scan 4.707 ms

6543 HEAD azjunk5 x[aei]{1,2}q 1075 Seq Scan 127.050 ms
6554 trgm_regex azjunk5 x[aei]{1,2}q 1075 Bitmap Heap Scan 8.474 ms

6543 HEAD azjunk5 x[aei]{1,3}q 1086 Seq Scan 127.090 ms
6554 trgm_regex azjunk5 x[aei]{1,3}q 1086 Bitmap Heap Scan 27.646 ms

6543 HEAD azjunk5 x[aeio]q 1292 Seq Scan 119.951 ms
6554 trgm_regex azjunk5 x[aeio]q 1292 Bitmap Heap Scan 3.881 ms

6543 HEAD azjunk5 x[aeio]{1}q 1292 Seq Scan 123.444 ms
6554 trgm_regex azjunk5 x[aeio]{1}q 1292 Bitmap Heap Scan 3.346 ms

6543 HEAD azjunk5 x[aeio]{1,1}q 1292 Seq Scan 124.024 ms
6554 trgm_regex azjunk5 x[aeio]{1,1}q 1292 Bitmap Heap Scan 3.681 ms

6543 HEAD azjunk5 x[aeio]{,2}q 0 Seq Scan 152.181 ms
6554 trgm_regex azjunk5 x[aeio]{,2}q 0 Bitmap Heap Scan 5.774 ms

6543 HEAD azjunk5 x[aeio]{,10}q 0 Seq Scan 214.168 ms
6554 trgm_regex azjunk5 x[aeio]{,10}q 0 Bitmap Heap Scan 12.074 ms

6543 HEAD azjunk5 x[aeio]{1,2}q 1441 Seq Scan 128.491 ms
6554 trgm_regex azjunk5 x[aeio]{1,2}q 1441 Bitmap Heap Scan 22.538 ms

6543 HEAD azjunk5 x[aeio]{1,3}q 1461 Seq Scan 132.987 ms
6554 trgm_regex azjunk5 x[aeio]{1,3}q 1461 Seq Scan 125.682 ms

6543 HEAD azjunk5 x[aeio]{1,4}q 1464 Seq Scan 132.729 ms
6554 trgm_regex azjunk5 x[aeio]{1,4}q 1464 Seq Scan 133.625 ms

6543 HEAD azjunk5 x[aeio]{2,4}q 175 Seq Scan 135.328 ms
6554 trgm_regex azjunk5 x[aeio]{2,4}q 175 Seq Scan 134.194 ms

6543 HEAD azjunk5 x[aeio]{3,4}q 23 Seq Scan 131.590 ms
6554 trgm_regex azjunk5 x[aeio]{3,4}q 23 Seq Scan 135.435 ms

6543 HEAD azjunk5 x[aeiou]q 1598 Seq Scan 124.063 ms
6554 trgm_regex azjunk5 x[aeiou]q 1598 Seq Scan 124.983 ms

6543 HEAD azjunk5 x[aeiou]{1}q 1598 Seq Scan 134.563 ms
6554 trgm_regex azjunk5 x[aeiou]{1}q 1598 Seq Scan 128.089 ms

6543 HEAD azjunk5 x[aeiou]{1,1}q 1598 Seq Scan 124.158 ms
6554 trgm_regex azjunk5 x[aeiou]{1,1}q 1598 Seq Scan 128.355 ms

6543 HEAD azjunk5 x[aeiou]{,2}q 0 Seq Scan 144.541 ms
6554 trgm_regex azjunk5 x[aeiou]{,2}q 0 Bitmap Heap Scan 2.369 ms

6543 HEAD azjunk5 x[aeiou]{,10}q 0 Seq Scan 208.091 ms
6554 trgm_regex azjunk5 x[aeiou]{,10}q 0 Bitmap Heap Scan 2.528 ms

6543 HEAD azjunk5 x[aeiou]{1,2}q 1838 Seq Scan 130.474 ms
6554 trgm_regex azjunk5 x[aeiou]{1,2}q 1838 Seq Scan 130.433 ms

6543 HEAD azjunk5 x[aeiou]{1,3}q 1886 Seq Scan 134.002 ms
6554 trgm_regex azjunk5 x[aeiou]{1,3}q 1886 Seq Scan 134.786 ms

6543 HEAD azjunk5 x[aeiou]{1,4}q 1892 Seq Scan 137.588 ms
6554 trgm_regex azjunk5 x[aeiou]{1,4}q 1892 Seq Scan 145.194 ms

6543 HEAD azjunk5 x[aeiou]{2,4}q 299 Seq Scan 136.125 ms
6554 trgm_regex azjunk5 x[aeiou]{2,4}q 299 Seq Scan 138.212 ms

6543 HEAD azjunk5 x[aeiou]{3,4}q 54 Seq Scan 135.205 ms
6554 trgm_regex azjunk5 x[aeiou]{3,4}q 54 Seq Scan 134.146 ms

6543 HEAD azjunk5 x[aeiou]{1,5}q 1895 Seq Scan 137.151 ms
6554 trgm_regex azjunk5 x[aeiou]{1,5}q 1895 Seq Scan 140.986 ms

6543 HEAD azjunk5 x[aeiou]{2,5}q 302 Seq Scan 142.189 ms
6554 trgm_regex azjunk5 x[aeiou]{2,5}q 302 Seq Scan 137.368 ms

6543 HEAD azjunk5 x[aeiou]{4,5}q 9 Seq Scan 138.165 ms
6554 trgm_regex azjunk5 x[aeiou]{4,5}q 9 Seq Scan 137.122 ms

6543 HEAD azjunk5 x[aeiouy]q 1913 Seq Scan 126.283 ms
6554 trgm_regex azjunk5 x[aeiouy]q 1913 Seq Scan 130.424 ms

6543 HEAD azjunk5 x[aeiouy]{1}q 1913 Seq Scan 125.947 ms
6554 trgm_regex azjunk5 x[aeiouy]{1}q 1913 Seq Scan 131.957 ms

6543 HEAD azjunk5 x[aeiouy]{1,1}q 1913 Seq Scan 126.529 ms
6554 trgm_regex azjunk5 x[aeiouy]{1,1}q 1913 Seq Scan 130.958 ms

6543 HEAD azjunk5 x[aeiouy]{,2}q 0 Seq Scan 147.704 ms
6554 trgm_regex azjunk5 x[aeiouy]{,2}q 0 Bitmap Heap Scan 2.331 ms

6543 HEAD azjunk5 x[aeiouy]{,10}q 0 Seq Scan 221.774 ms
6554 trgm_regex azjunk5 x[aeiouy]{,10}q 0 Bitmap Heap Scan 2.522 ms

6543 HEAD azjunk5 x[aeiouy]{1,2}q 2275 Seq Scan 134.044 ms
6554 trgm_regex azjunk5 x[aeiouy]{1,2}q 2275 Seq Scan 136.827 ms

6543 HEAD azjunk5 x[aeiouy]{1,3}q 2358 Seq Scan 135.599 ms
6554 trgm_regex azjunk5 x[aeiouy]{1,3}q 2358 Seq Scan 134.196 ms

6543 HEAD azjunk5 x[aeiouy]{1,4}q 2376 Seq Scan 138.685 ms
6554 trgm_regex azjunk5 x[aeiouy]{1,4}q 2376 Seq Scan 141.408 ms

6543 HEAD azjunk5 x[aeiouy]{2,4}q 474 Seq Scan 142.223 ms
6554 trgm_regex azjunk5 x[aeiouy]{2,4}q 474 Seq Scan 143.439 ms

6543 HEAD azjunk5 x[aeiouy]{3,4}q 103 Seq Scan 138.690 ms
6554 trgm_regex azjunk5 x[aeiouy]{3,4}q 103 Seq Scan 136.192 ms

6543 HEAD azjunk5 x[aeiouy]{1,5}q 2381 Seq Scan 140.836 ms
6554 trgm_regex azjunk5 x[aeiouy]{1,5}q 2381 Seq Scan 143.374 ms

6543 HEAD azjunk5 x[aeiouy]{2,5}q 479 Seq Scan 140.223 ms
6554 trgm_regex azjunk5 x[aeiouy]{2,5}q 479 Seq Scan 139.995 ms

6543 HEAD azjunk5 x[aeiouy]{4,5}q 23 Seq Scan 139.976 ms
6554 trgm_regex azjunk5 x[aeiouy]{4,5}q 23 Seq Scan 138.114 ms

6543 HEAD azjunk6 x[ae]q 6448 Seq Scan 1219.490 ms
6554 trgm_regex azjunk6 x[ae]q 6448 Bitmap Heap Scan 23.452 ms

6543 HEAD azjunk6 x[ae]{1}q 6448 Seq Scan 1153.371 ms
6554 trgm_regex azjunk6 x[ae]{1}q 6448 Bitmap Heap Scan 18.492 ms

6543 HEAD azjunk6 x[ae]{1,1}q 6448 Seq Scan 1189.951 ms
6554 trgm_regex azjunk6 x[ae]{1,1}q 6448 Bitmap Heap Scan 24.596 ms

6543 HEAD azjunk6 x[ae]{,2}q 0 Seq Scan 1423.474 ms
6554 trgm_regex azjunk6 x[ae]{,2}q 0 Bitmap Heap Scan 41.593 ms

6543 HEAD azjunk6 x[ae]{,10}q 0 Seq Scan 1957.142 ms
6554 trgm_regex azjunk6 x[ae]{,10}q 0 Bitmap Heap Scan 45.238 ms

6543 HEAD azjunk6 x[ae]{1,2}q 6886 Seq Scan 1253.761 ms
6554 trgm_regex azjunk6 x[ae]{1,2}q 6886 Bitmap Heap Scan 31.247 ms

6543 HEAD azjunk6 x[aei]q 9600 Seq Scan 1203.022 ms
6554 trgm_regex azjunk6 x[aei]q 9600 Bitmap Heap Scan 31.467 ms

6543 HEAD azjunk6 x[aei]{1}q 9600 Seq Scan 1213.834 ms
6554 trgm_regex azjunk6 x[aei]{1}q 9600 Bitmap Heap Scan 26.008 ms

6543 HEAD azjunk6 x[aei]{1,1}q 9600 Seq Scan 1244.158 ms
6554 trgm_regex azjunk6 x[aei]{1,1}q 9600 Bitmap Heap Scan 25.997 ms

6543 HEAD azjunk6 x[aei]{,2}q 0 Seq Scan 1432.935 ms
6554 trgm_regex azjunk6 x[aei]{,2}q 0 Bitmap Heap Scan 44.843 ms

6543 HEAD azjunk6 x[aei]{,10}q 0 Seq Scan 1940.611 ms
6554 trgm_regex azjunk6 x[aei]{,10}q 0 Bitmap Heap Scan 45.838 ms

6543 HEAD azjunk6 x[aei]{1,2}q 10604 Seq Scan 1235.913 ms
6554 trgm_regex azjunk6 x[aei]{1,2}q 10604 Bitmap Heap Scan 78.764 ms

6543 HEAD azjunk6 x[aei]{1,3}q 10704 Seq Scan 1244.960 ms
6554 trgm_regex azjunk6 x[aei]{1,3}q 10704 Bitmap Heap Scan 272.049 ms

6543 HEAD azjunk6 x[aei]q 9600 Seq Scan 1211.965 ms
6554 trgm_regex azjunk6 x[aei]q 9600 Bitmap Heap Scan 26.230 ms

6543 HEAD azjunk6 x[aei]{1}q 9600 Seq Scan 1218.431 ms
6554 trgm_regex azjunk6 x[aei]{1}q 9600 Bitmap Heap Scan 25.462 ms

6543 HEAD azjunk6 x[aei]{1,1}q 9600 Seq Scan 1250.050 ms
6554 trgm_regex azjunk6 x[aei]{1,1}q 9600 Bitmap Heap Scan 25.711 ms

6543 HEAD azjunk6 x[aei]{,2}q 0 Seq Scan 1457.725 ms
6554 trgm_regex azjunk6 x[aei]{,2}q 0 Bitmap Heap Scan 43.491 ms

6543 HEAD azjunk6 x[aei]{,10}q 0 Seq Scan 2034.895 ms
6554 trgm_regex azjunk6 x[aei]{,10}q 0 Bitmap Heap Scan 46.139 ms

6543 HEAD azjunk6 x[aei]{1,2}q 10604 Seq Scan 1250.820 ms
6554 trgm_regex azjunk6 x[aei]{1,2}q 10604 Bitmap Heap Scan 78.067 ms

6543 HEAD azjunk6 x[aei]{1,3}q 10704 Seq Scan 1265.146 ms
6554 trgm_regex azjunk6 x[aei]{1,3}q 10704 Bitmap Heap Scan 274.109 ms

6543 HEAD azjunk6 x[aeio]q 12784 Seq Scan 1235.647 ms
6554 trgm_regex azjunk6 x[aeio]q 12784 Bitmap Heap Scan 35.613 ms

6543 HEAD azjunk6 x[aeio]{1}q 12784 Seq Scan 1206.185 ms
6554 trgm_regex azjunk6 x[aeio]{1}q 12784 Bitmap Heap Scan 39.618 ms

6543 HEAD azjunk6 x[aeio]{1,1}q 12784 Seq Scan 1210.467 ms
6554 trgm_regex azjunk6 x[aeio]{1,1}q 12784 Bitmap Heap Scan 34.513 ms

6543 HEAD azjunk6 x[aeio]{,2}q 0 Seq Scan 1457.918 ms
6554 trgm_regex azjunk6 x[aeio]{,2}q 0 Bitmap Heap Scan 55.732 ms

6543 HEAD azjunk6 x[aeio]{,10}q 0 Seq Scan 2104.860 ms
6554 trgm_regex azjunk6 x[aeio]{,10}q 0 Bitmap Heap Scan 62.129 ms

6543 HEAD azjunk6 x[aeio]{1,2}q 14538 Seq Scan 1286.881 ms
6554 trgm_regex azjunk6 x[aeio]{1,2}q 14538 Bitmap Heap Scan 182.161 ms

6543 HEAD azjunk6 x[aeio]{1,3}q 14761 Seq Scan 1291.199 ms
6554 trgm_regex azjunk6 x[aeio]{1,3}q 14761 Bitmap Heap Scan 1445.593 ms

6543 HEAD azjunk6 x[aeio]{1,4}q 14791 Seq Scan 1331.960 ms
6554 trgm_regex azjunk6 x[aeio]{1,4}q 14791 Seq Scan 1340.845 ms

6543 HEAD azjunk6 x[aeio]{2,4}q 2024 Seq Scan 1337.631 ms
6554 trgm_regex azjunk6 x[aeio]{2,4}q 2024 Seq Scan 1354.844 ms

6543 HEAD azjunk6 x[aeio]{3,4}q 257 Seq Scan 1321.271 ms
6554 trgm_regex azjunk6 x[aeio]{3,4}q 257 Seq Scan 1335.737 ms

6543 HEAD azjunk6 x[aeiou]q 15976 Seq Scan 1237.313 ms
6554 trgm_regex azjunk6 x[aeiou]q 15976 Seq Scan 1268.531 ms

6543 HEAD azjunk6 x[aeiou]{1}q 15976 Seq Scan 1251.777 ms
6554 trgm_regex azjunk6 x[aeiou]{1}q 15976 Seq Scan 1268.431 ms

6543 HEAD azjunk6 x[aeiou]{1,1}q 15976 Seq Scan 1243.416 ms
6554 trgm_regex azjunk6 x[aeiou]{1,1}q 15976 Seq Scan 1263.152 ms

6543 HEAD azjunk6 x[aeiou]{,2}q 0 Seq Scan 1476.587 ms
6554 trgm_regex azjunk6 x[aeiou]{,2}q 0 Bitmap Heap Scan 19.583 ms

6543 HEAD azjunk6 x[aeiou]{,10}q 0 Seq Scan 2084.845 ms
6554 trgm_regex azjunk6 x[aeiou]{,10}q 0 Bitmap Heap Scan 21.377 ms

6543 HEAD azjunk6 x[aeiou]{1,2}q 18692 Seq Scan 1302.585 ms
6554 trgm_regex azjunk6 x[aeiou]{1,2}q 18692 Seq Scan 1330.683 ms

6543 HEAD azjunk6 x[aeiou]{1,3}q 19128 Seq Scan 1290.309 ms
6554 trgm_regex azjunk6 x[aeiou]{1,3}q 19128 Seq Scan 1317.831 ms

6543 HEAD azjunk6 x[aeiou]{1,4}q 19202 Seq Scan 1347.727 ms
6554 trgm_regex azjunk6 x[aeiou]{1,4}q 19202 Seq Scan 1361.307 ms

6543 HEAD azjunk6 x[aeiou]{2,4}q 3268 Seq Scan 1362.704 ms
6554 trgm_regex azjunk6 x[aeiou]{2,4}q 3268 Seq Scan 1372.468 ms

6543 HEAD azjunk6 x[aeiou]{3,4}q 523 Seq Scan 1321.774 ms
6554 trgm_regex azjunk6 x[aeiou]{3,4}q 523 Seq Scan 1346.200 ms

6543 HEAD azjunk6 x[aeiou]{1,5}q 19214 Seq Scan 1367.949 ms
6554 trgm_regex azjunk6 x[aeiou]{1,5}q 19214 Seq Scan 1428.444 ms

6543 HEAD azjunk6 x[aeiou]{2,5}q 3280 Seq Scan 1349.375 ms
6554 trgm_regex azjunk6 x[aeiou]{2,5}q 3280 Seq Scan 1375.887 ms

6543 HEAD azjunk6 x[aeiou]{4,5}q 88 Seq Scan 1324.008 ms
6554 trgm_regex azjunk6 x[aeiou]{4,5}q 88 Seq Scan 1394.067 ms

6543 HEAD azjunk6 x[aeiouy]q 19168 Seq Scan 1262.363 ms
6554 trgm_regex azjunk6 x[aeiouy]q 19168 Seq Scan 1248.167 ms

6543 HEAD azjunk6 x[aeiouy]{1}q 19168 Seq Scan 1257.760 ms
6554 trgm_regex azjunk6 x[aeiouy]{1}q 19168 Seq Scan 1276.502 ms

6543 HEAD azjunk6 x[aeiouy]{1,1}q 19168 Seq Scan 1282.770 ms
6554 trgm_regex azjunk6 x[aeiouy]{1,1}q 19168 Seq Scan 1284.173 ms

6543 HEAD azjunk6 x[aeiouy]{,2}q 0 Seq Scan 1483.940 ms
6554 trgm_regex azjunk6 x[aeiouy]{,2}q 0 Bitmap Heap Scan 20.634 ms

6543 HEAD azjunk6 x[aeiouy]{,10}q 0 Seq Scan 2058.701 ms
6554 trgm_regex azjunk6 x[aeiouy]{,10}q 0 Bitmap Heap Scan 21.596 ms

6543 HEAD azjunk6 x[aeiouy]{1,2}q 23069 Seq Scan 1340.593 ms
6554 trgm_regex azjunk6 x[aeiouy]{1,2}q 23069 Seq Scan 1322.919 ms

6543 HEAD azjunk6 x[aeiouy]{1,3}q 23844 Seq Scan 1321.853 ms
6554 trgm_regex azjunk6 x[aeiouy]{1,3}q 23844 Seq Scan 1333.974 ms

6543 HEAD azjunk6 x[aeiouy]{1,4}q 23993 Seq Scan 1377.787 ms
6554 trgm_regex azjunk6 x[aeiouy]{1,4}q 23993 Seq Scan 1389.073 ms

6543 HEAD azjunk6 x[aeiouy]{2,4}q 4903 Seq Scan 1392.936 ms
6554 trgm_regex azjunk6 x[aeiouy]{2,4}q 4903 Seq Scan 1399.154 ms

6543 HEAD azjunk6 x[aeiouy]{3,4}q 944 Seq Scan 1342.379 ms
6554 trgm_regex azjunk6 x[aeiouy]{3,4}q 944 Seq Scan 1375.420 ms

6543 HEAD azjunk6 x[aeiouy]{1,5}q 24028 Seq Scan 1402.588 ms
6554 trgm_regex azjunk6 x[aeiouy]{1,5}q 24028 Seq Scan 1482.936 ms

6543 HEAD azjunk6 x[aeiouy]{2,5}q 4938 Seq Scan 1378.311 ms
6554 trgm_regex azjunk6 x[aeiouy]{2,5}q 4938 Seq Scan 1402.020 ms

6543 HEAD azjunk6 x[aeiouy]{4,5}q 189 Seq Scan 1348.171 ms
6554 trgm_regex azjunk6 x[aeiouy]{4,5}q 189 Seq Scan 1392.002 ms

6543 HEAD azjunk7 x[ae]q 63781 Seq Scan 11722.978 ms
6554 trgm_regex azjunk7 x[ae]q 63781 Bitmap Heap Scan 418.407 ms

6543 HEAD azjunk7 x[ae]{1}q 63781 Seq Scan 11787.311 ms
6554 trgm_regex azjunk7 x[ae]{1}q 63781 Bitmap Heap Scan 423.027 ms

6543 HEAD azjunk7 x[ae]{1,1}q 63781 Seq Scan 11902.061 ms
6554 trgm_regex azjunk7 x[ae]{1,1}q 63781 Bitmap Heap Scan 420.819 ms

6543 HEAD azjunk7 x[ae]{,2}q 0 Seq Scan 14144.148 ms
6554 trgm_regex azjunk7 x[ae]{,2}q 0 Bitmap Heap Scan 343.806 ms

6543 HEAD azjunk7 x[ae]{,10}q 0 Seq Scan 20390.872 ms
6554 trgm_regex azjunk7 x[ae]{,10}q 0 Bitmap Heap Scan 370.856 ms

6543 HEAD azjunk7 x[ae]{1,2}q 68145 Seq Scan 12569.198 ms
6554 trgm_regex azjunk7 x[ae]{1,2}q 68145 Bitmap Heap Scan 571.570 ms

6543 HEAD azjunk7 x[aei]q 95281 Seq Scan 12027.646 ms
6554 trgm_regex azjunk7 x[aei]q 95281 Bitmap Heap Scan 579.807 ms

6543 HEAD azjunk7 x[aei]{1}q 95281 Seq Scan 12213.674 ms
6554 trgm_regex azjunk7 x[aei]{1}q 95281 Bitmap Heap Scan 581.085 ms

6543 HEAD azjunk7 x[aei]{1,1}q 95281 Seq Scan 12121.898 ms
6554 trgm_regex azjunk7 x[aei]{1,1}q 95281 Bitmap Heap Scan 587.568 ms

6543 HEAD azjunk7 x[aei]{,2}q 0 Seq Scan 14519.020 ms
6554 trgm_regex azjunk7 x[aei]{,2}q 0 Bitmap Heap Scan 440.596 ms

6543 HEAD azjunk7 x[aei]{,10}q 0 Seq Scan 20829.970 ms
6554 trgm_regex azjunk7 x[aei]{,10}q 0 Bitmap Heap Scan 443.136 ms

6543 HEAD azjunk7 x[aei]{1,2}q 105054 Seq Scan 12967.634 ms
6554 trgm_regex azjunk7 x[aei]{1,2}q 105054 Bitmap Heap Scan 1151.202 ms

6543 HEAD azjunk7 x[aei]{1,3}q 106031 Seq Scan 12601.485 ms
6554 trgm_regex azjunk7 x[aei]{1,3}q 106031 Bitmap Heap Scan 3084.092 ms

6543 HEAD azjunk7 x[aei]q 95281 Seq Scan 12251.805 ms
6554 trgm_regex azjunk7 x[aei]q 95281 Bitmap Heap Scan 579.398 ms

6543 HEAD azjunk7 x[aei]{1}q 95281 Seq Scan 12251.196 ms
6554 trgm_regex azjunk7 x[aei]{1}q 95281 Bitmap Heap Scan 579.351 ms

6543 HEAD azjunk7 x[aei]{1,1}q 95281 Seq Scan 12176.216 ms
6554 trgm_regex azjunk7 x[aei]{1,1}q 95281 Bitmap Heap Scan 577.931 ms

6543 HEAD azjunk7 x[aei]{,2}q 0 Seq Scan 14632.855 ms
6554 trgm_regex azjunk7 x[aei]{,2}q 0 Bitmap Heap Scan 434.758 ms

6543 HEAD azjunk7 x[aei]{,10}q 0 Seq Scan 20637.829 ms
6554 trgm_regex azjunk7 x[aei]{,10}q 0 Bitmap Heap Scan 440.237 ms

6543 HEAD azjunk7 x[aei]{1,2}q 105054 Seq Scan 12967.108 ms
6554 trgm_regex azjunk7 x[aei]{1,2}q 105054 Bitmap Heap Scan 1166.260 ms

6543 HEAD azjunk7 x[aei]{1,3}q 106031 Seq Scan 12820.629 ms
6554 trgm_regex azjunk7 x[aei]{1,3}q 106031 Bitmap Heap Scan 3079.662 ms

6543 HEAD azjunk7 x[aeio]q 126868 Seq Scan 12535.441 ms
6554 trgm_regex azjunk7 x[aeio]q 126868 Bitmap Heap Scan 737.634 ms

6543 HEAD azjunk7 x[aeio]{1}q 126868 Seq Scan 12338.188 ms
6554 trgm_regex azjunk7 x[aeio]{1}q 126868 Bitmap Heap Scan 749.844 ms

6543 HEAD azjunk7 x[aeio]{1,1}q 126868 Seq Scan 12579.271 ms
6554 trgm_regex azjunk7 x[aeio]{1,1}q 126868 Bitmap Heap Scan 731.667 ms

6543 HEAD azjunk7 x[aeio]{,2}q 0 Seq Scan 14806.573 ms
6554 trgm_regex azjunk7 x[aeio]{,2}q 0 Bitmap Heap Scan 555.302 ms

6543 HEAD azjunk7 x[aeio]{,10}q 0 Seq Scan 22000.135 ms
6554 trgm_regex azjunk7 x[aeio]{,10}q 0 Bitmap Heap Scan 560.526 ms

6543 HEAD azjunk7 x[aeio]{1,2}q 144180 Seq Scan 12919.840 ms
6554 trgm_regex azjunk7 x[aeio]{1,2}q 144180 Bitmap Heap Scan 2245.885 ms

6543 HEAD azjunk7 x[aeio]{1,3}q 146526 Seq Scan 12807.513 ms
6554 trgm_regex azjunk7 x[aeio]{1,3}q 146526 Bitmap Heap Scan 15261.582 ms

6543 HEAD azjunk7 x[aeio]{1,4}q 146834 Seq Scan 13179.285 ms
6554 trgm_regex azjunk7 x[aeio]{1,4}q 146834 Seq Scan 13874.164 ms

6543 HEAD azjunk7 x[aeio]{2,4}q 20220 Seq Scan 13365.779 ms
6554 trgm_regex azjunk7 x[aeio]{2,4}q 20220 Seq Scan 13544.404 ms

6543 HEAD azjunk7 x[aeio]{3,4}q 2697 Seq Scan 13224.408 ms
6554 trgm_regex azjunk7 x[aeio]{3,4}q 2697 Seq Scan 13699.898 ms

6543 HEAD azjunk7 x[aeiou]q 158778 Seq Scan 12753.739 ms
6554 trgm_regex azjunk7 x[aeiou]q 158778 Seq Scan 12736.813 ms

6543 HEAD azjunk7 x[aeiou]{1}q 158778 Seq Scan 12385.108 ms
6554 trgm_regex azjunk7 x[aeiou]{1}q 158778 Seq Scan 12852.739 ms

6543 HEAD azjunk7 x[aeiou]{1,1}q 158778 Seq Scan 12665.614 ms
6554 trgm_regex azjunk7 x[aeiou]{1,1}q 158778 Seq Scan 12482.476 ms

6543 HEAD azjunk7 x[aeiou]{,2}q 0 Seq Scan 14886.647 ms
6554 trgm_regex azjunk7 x[aeiou]{,2}q 0 Bitmap Heap Scan 197.807 ms

6543 HEAD azjunk7 x[aeiou]{,10}q 0 Seq Scan 21428.416 ms
6554 trgm_regex azjunk7 x[aeiou]{,10}q 0 Bitmap Heap Scan 210.265 ms

6543 HEAD azjunk7 x[aeiou]{1,2}q 185669 Seq Scan 12896.338 ms
6554 trgm_regex azjunk7 x[aeiou]{1,2}q 185669 Seq Scan 13354.702 ms

6543 HEAD azjunk7 x[aeiou]{1,3}q 190274 Seq Scan 12730.517 ms
6554 trgm_regex azjunk7 x[aeiou]{1,3}q 190274 Seq Scan 13026.644 ms

6543 HEAD azjunk7 x[aeiou]{1,4}q 191017 Seq Scan 13664.473 ms
6554 trgm_regex azjunk7 x[aeiou]{1,4}q 191017 Seq Scan 13743.875 ms

6543 HEAD azjunk7 x[aeiou]{2,4}q 32732 Seq Scan 13360.429 ms
6554 trgm_regex azjunk7 x[aeiou]{2,4}q 32732 Seq Scan 13804.770 ms

6543 HEAD azjunk7 x[aeiou]{3,4}q 5449 Seq Scan 13170.928 ms
6554 trgm_regex azjunk7 x[aeiou]{3,4}q 5449 Seq Scan 13394.707 ms

6543 HEAD azjunk7 x[aeiou]{1,5}q 191160 Seq Scan 13591.866 ms
6554 trgm_regex azjunk7 x[aeiou]{1,5}q 191160 Seq Scan 14158.325 ms

6543 HEAD azjunk7 x[aeiou]{2,5}q 32878 Seq Scan 13507.736 ms
6554 trgm_regex azjunk7 x[aeiou]{2,5}q 32878 Seq Scan 13687.159 ms

6543 HEAD azjunk7 x[aeiou]{4,5}q 903 Seq Scan 13329.291 ms
6554 trgm_regex azjunk7 x[aeiou]{4,5}q 903 Seq Scan 13645.331 ms

6543 HEAD azjunk7 x[aeiouy]q 190245 Seq Scan 12331.375 ms
6554 trgm_regex azjunk7 x[aeiouy]q 190245 Seq Scan 12726.390 ms

6543 HEAD azjunk7 x[aeiouy]{1}q 190245 Seq Scan 12752.629 ms
6554 trgm_regex azjunk7 x[aeiouy]{1}q 190245 Seq Scan 12712.805 ms

6543 HEAD azjunk7 x[aeiouy]{1,1}q 190245 Seq Scan 12500.269 ms
6554 trgm_regex azjunk7 x[aeiouy]{1,1}q 190245 Seq Scan 12863.557 ms

6543 HEAD azjunk7 x[aeiouy]{,2}q 0 Seq Scan 14746.988 ms
6554 trgm_regex azjunk7 x[aeiouy]{,2}q 0 Bitmap Heap Scan 194.024 ms

6543 HEAD azjunk7 x[aeiouy]{,10}q 0 Seq Scan 21648.192 ms
6554 trgm_regex azjunk7 x[aeiouy]{,10}q 0 Bitmap Heap Scan 208.955 ms

6543 HEAD azjunk7 x[aeiouy]{1,2}q 228677 Seq Scan 13359.817 ms
6554 trgm_regex azjunk7 x[aeiouy]{1,2}q 228677 Seq Scan 13358.769 ms

6543 HEAD azjunk7 x[aeiouy]{1,3}q 236512 Seq Scan 13191.587 ms
6554 trgm_regex azjunk7 x[aeiouy]{1,3}q 236512 Seq Scan 13504.745 ms

6543 HEAD azjunk7 x[aeiouy]{1,4}q 238061 Seq Scan 13756.733 ms
6554 trgm_regex azjunk7 x[aeiouy]{1,4}q 238061 Seq Scan 13929.557 ms

6543 HEAD azjunk7 x[aeiouy]{2,4}q 48681 Seq Scan 13766.984 ms
6554 trgm_regex azjunk7 x[aeiouy]{2,4}q 48681 Seq Scan 14135.231 ms

6543 HEAD azjunk7 x[aeiouy]{3,4}q 9602 Seq Scan 13429.259 ms
6554 trgm_regex azjunk7 x[aeiouy]{3,4}q 9602 Seq Scan 13539.163 ms

6543 HEAD azjunk7 x[aeiouy]{1,5}q 238407 Seq Scan 13863.009 ms
6554 trgm_regex azjunk7 x[aeiouy]{1,5}q 238407 Seq Scan 14091.161 ms

6543 HEAD azjunk7 x[aeiouy]{2,5}q 49031 Seq Scan 14006.522 ms
6554 trgm_regex azjunk7 x[aeiouy]{2,5}q 49031 Seq Scan 14105.039 ms

6543 HEAD azjunk7 x[aeiouy]{4,5}q 1935 Seq Scan 13718.130 ms
6554 trgm_regex azjunk7 x[aeiouy]{4,5}q 1935 Seq Scan 14032.751 ms

(You asked also for testing against real text, I'll probably some of that too (although I do not
expect all that many differences).

Thanks, great work!

Erik Rijkers

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-29 13:25:22
Message-ID:	50B76242.8070907@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

One thing that bothers me with this algoritm is that the overflow
mechanism is all-or-nothing. In many cases, even when there is a huge
number of states in the diagram, you could still extract at least a few
trigrams that must be present in any matching string, with little
effort. At least, it seems like that to a human :-).

For example, consider this:

explain analyze select count(*) from azjunk4 where txt ~
('^aabaacaadaaeaafaagaahaaiaajaakaalaamaanaaoaapaaqaaraasaataauaavaawaaxaayaazabaabbabcabdabeabfabgabhabiabjabkablabmabnaboabpabqabrabsabtabuabvabwabxabyabzacaacbaccacdaceacfacgachaciacjackaclacmacnacoacpacqacracsactacuacvacwacxacyaczadaadbadcaddadeadfadgadhadiadjadkadladmadnadoadpadqadradsadtaduadvadwadxadyadzaeaaebaecaedaeeaefaegaehaeiaejaekaelaemaenaeoaepaeqaeraesaetaeuaevaewaexaeyaezafaafbafcafdafeaffafgafhafiafjafkaflafmafnafoafpafqafrafsaftafuafvafwafxafyafzagaagbagcagdageagfaggaghagiagjagkaglagmagnagoagpagqagragsagtaguagvagwagxagyagzahaahbahcahdaheahfahgahhahiahjahkahlahmahnahoahpahqahrahs$');

you get a query plan like this (the long regexp string edited out):

Aggregate (cost=228148.02..228148.03 rows=1 width=0) (actual
time=131.100..131
.101 rows=1 loops=1)
-> Bitmap Heap Scan on azjunk4 (cost=228144.01..228148.02 rows=1
width=0) (
actual time=131.096..131.096 rows=0 loops=1)
Recheck Cond: (txt ~ <ridiculously long regexp>)
Rows Removed by Index Recheck: 10000
-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
(cost=0.00..228144
.01 rows=1 width=0) (actual time=82.914..82.914 rows=10000 loops=1)
Index Cond: (txt ~ <ridiculously long regexp>)
Total runtime: 131.230 ms
(7 rows)

That ridiculously long string exceeds the number of states (I think,
could be number of paths or arcs too), and the algorithm gives up,
resorting to scanning the whole index as can be seen by the "Rows
Removed by Index Recheck" line. However, it's easy to see that any
matching string must contain *any* of the possible trigrams the
algorithm extracts. If it could safely return just a few of them, say
"aab" and "abz", and discard the rest, that would already be much better
than a full index scan.

Would it be safe to simply stop short the depth-first search on
overflow, and proceed with the graph that was constructed up to that point?

- Heikki

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-30 11:20:02
Message-ID:	CAPpHfdtpYFWFe8bRdmmvKSD39YM8UTeEzuMrKHi8Rr61CzzdtA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> One thing that bothers me with this algoritm is that the overflow
> mechanism is all-or-nothing. In many cases, even when there is a huge
> number of states in the diagram, you could still extract at least a few
> trigrams that must be present in any matching string, with little effort.
> At least, it seems like that to a human :-).
>
> For example, consider this:
>
> explain analyze select count(*) from azjunk4 where txt ~ ('^**
> aabaacaadaaeaafaagaahaaiaajaak**aalaamaanaaoaapaaqaaraasaataau**
> aavaawaaxaayaazabaabbabcabdabe**abfabgabhabiabjabkablabmabnabo**
> abpabqabrabsabtabuabvabwabxaby**abzacaacbaccacdaceacfacgachaci**
> acjackaclacmacnacoacpacqacracs**actacuacvacwacxacyaczadaadbadc**
> addadeadfadgadhadiadjadkadladm**adnadoadpadqadradsadtaduadvadw**
> adxadyadzaeaaebaecaedaeeaefaeg**aehaeiaejaekaelaemaenaeoaepaeq**
> aeraesaetaeuaevaewaexaeyaezafa**afbafcafdafeaffafgafhafiafjafk**
> aflafmafnafoafpafqafrafsaftafu**afvafwafxafyafzagaagbagcagdage**
> agfaggaghagiagjagkaglagmagnago**agpagqagragsagtaguagvagwagxagy**
> agzahaahbahcahdaheahfahgahhahi**ahjahkahlahmahnahoahpahqahrahs**$');
>
> you get a query plan like this (the long regexp string edited out):
>
> Aggregate (cost=228148.02..228148.03 rows=1 width=0) (actual
> time=131.100..131
> .101 rows=1 loops=1)
> -> Bitmap Heap Scan on azjunk4 (cost=228144.01..228148.02 rows=1
> width=0) (
> actual time=131.096..131.096 rows=0 loops=1)
> Recheck Cond: (txt ~ <ridiculously long regexp>)
> Rows Removed by Index Recheck: 10000
> -> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
> (cost=0.00..228144
> .01 rows=1 width=0) (actual time=82.914..82.914 rows=10000 loops=1)
> Index Cond: (txt ~ <ridiculously long regexp>)
> Total runtime: 131.230 ms
> (7 rows)
>
> That ridiculously long string exceeds the number of states (I think, could
> be number of paths or arcs too), and the algorithm gives up, resorting to
> scanning the whole index as can be seen by the "Rows Removed by Index
> Recheck" line. However, it's easy to see that any matching string must
> contain *any* of the possible trigrams the algorithm extracts. If it could
> safely return just a few of them, say "aab" and "abz", and discard the
> rest, that would already be much better than a full index scan.
>
> Would it be safe to simply stop short the depth-first search on overflow,
> and proceed with the graph that was constructed up to that point?

For depth-first it's not. But your proposal naturally makes sense. I've
changed it to breadth-first search. And then it's safe to mark all
unprocessed states as final when overflow. It means that we assume every
unprocessed branch to immediately finish with matching (this can give us
more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the
trigrams.
New version of patch is attached.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.7.patch.gz	application/x-gzip	15.0 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	er <er(at)xs4all(dot)nl>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-30 11:22:05
Message-ID:	CAPpHfdviFLzvHs=C+4J2U6CygmGsqmKwWh1JWkc6P62AOQi53Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Thu, Nov 29, 2012 at 12:58 PM, er <er(at)xs4all(dot)nl> wrote:

> On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:
>
> > trgm-regexp-0.6.patch.gz
>
> I ran the simple-minded tests against generated data (similar to the ones
> I did in January 2012).
> The problems of that older version seem pretty much all removed. (although
> I didn't do much work
> on it -- just reran these tests).
>

Thanks a lot for testing! Could you repeat for 0.7 version of patch which
has new overflow handling?

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-30 11:28:37
Message-ID:	CAPpHfds6tFyKHoMw0uMMJzXMu6KtTgOUg7knLbOGx=M4bFL6CA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 30, 2012 at 3:20 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> For depth-first it's not.
>

Oh, I didn't explained it.
In order to stop graph processing we need to be sure that we put all
outgoing arcs from state or assume that state to be final. In DFS we can be
in the final part of graph producing but still didn't add some arc (with
new trigram) from initial state directly to the final state. It obviously
leads to false negatives.

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-11-30 14:23:59
Message-ID:	50B8C17F.6000507@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.11.2012 13:20, Alexander Korotkov wrote:
> On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> Would it be safe to simply stop short the depth-first search on overflow,
>> and proceed with the graph that was constructed up to that point?
>
> For depth-first it's not. But your proposal naturally makes sense. I've
> changed it to breadth-first search. And then it's safe to mark all
> unprocessed states as final when overflow. It means that we assume every
> unprocessed branch to immediately finish with matching (this can give us
> more false positives but no false negatives).
> For overflow of matrix collection, it's safe to do just OR between all the
> trigrams.
> New version of patch is attached.

Thanks, sounds good.

I've spent quite a long time trying to understand the transformation the
getState/addKeys/addAcrs functions do to the original CNFA graph. I
think that still needs more comments to explain the steps involved in it.

One thing that occurs to me is that it might be better to delay
expanding colors to characters. You could build and transform the graph,
and even create the paths, all while operating on colors. You would end
up with lists of "color trigrams", consisting of sequences of three
colors that must appear in the source string. Only at the last step you
would expand the color trigrams to real character trigrams. I think that
would save a lot of processing while building the graph, if you have
colors that contain many characters. It would allow us to do better than
the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS
= 4 in the current patch, it's a shame that we can't do anything with a
fairly simple regexp like "^a[b-g]h$"

- Heikki

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-01 11:22:56
Message-ID:	122364368bf732c49cd1236ea72da722.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, November 30, 2012 12:22, Alexander Korotkov wrote:
> Hi!
>
> On Thu, Nov 29, 2012 at 12:58 PM, er <er(at)xs4all(dot)nl> wrote:
>
>> On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:
>>
>>
>> I ran the simple-minded tests against generated data (similar to the ones
>> I did in January 2012).
>> The problems of that older version seem pretty much all removed. (although
>> I didn't do much work
>> on it -- just reran these tests).
>>
>
> Thanks a lot for testing! Could you repeat for 0.7 version of patch which
> has new overflow handling?
>

I've attached a similar test re-run that compares HEAD with patch versions 0.6, and 0.7.

Erik Rijkers

Attachment	Content-Type	Size
trgm_compare.txt.gz	application/x-gzip	6.0 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-02 18:06:22
Message-ID:	CAPpHfdsTEGmx-fmBzaqFED8ghiDX+UA0duph=UL383w80A=oDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 30.11.2012 13:20, Alexander Korotkov wrote:
>
>> On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas(at)**
>> vmware.com <hlinnakangas(at)vmware(dot)com>
>>
>>> wrote:
>>>
>>
>> Would it be safe to simply stop short the depth-first search on overflow,
>>> and proceed with the graph that was constructed up to that point?
>>>
>>
>> For depth-first it's not. But your proposal naturally makes sense. I've
>> changed it to breadth-first search. And then it's safe to mark all
>> unprocessed states as final when overflow. It means that we assume every
>> unprocessed branch to immediately finish with matching (this can give us
>> more false positives but no false negatives).
>> For overflow of matrix collection, it's safe to do just OR between all the
>> trigrams.
>> New version of patch is attached.
>>
>
> Thanks, sounds good.
>
> I've spent quite a long time trying to understand the transformation the
> getState/addKeys/addAcrs functions do to the original CNFA graph. I think
> that still needs more comments to explain the steps involved in it.
>
> One thing that occurs to me is that it might be better to delay expanding
> colors to characters. You could build and transform the graph, and even
> create the paths, all while operating on colors. You would end up with
> lists of "color trigrams", consisting of sequences of three colors that
> must appear in the source string. Only at the last step you would expand
> the color trigrams to real character trigrams. I think that would save a
> lot of processing while building the graph, if you have colors that contain
> many characters. It would allow us to do better than the fixed small
> MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current
> patch, it's a shame that we can't do anything with a fairly simple regexp
> like "^a[b-g]h$"
>

Nice idea to delay expanding colors to characters! Obviously, we should
delay expanding inly alphanumerical characters. Because non-alphanumberical
characters influence graph structure. Trying to implement...

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-02 18:07:48
Message-ID:	CAPpHfdu7k3jOiBAdYxX1gOgQeWguoYdyt9QoHvrM-N-88_Tnug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 1, 2012 at 3:22 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Fri, November 30, 2012 12:22, Alexander Korotkov wrote:
> > Hi!
> >
> > On Thu, Nov 29, 2012 at 12:58 PM, er <er(at)xs4all(dot)nl> wrote:
> >
> >> On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:
> >>
> >>
> >> I ran the simple-minded tests against generated data (similar to the
> ones
> >> I did in January 2012).
> >> The problems of that older version seem pretty much all removed.
> (although
> >> I didn't do much work
> >> on it -- just reran these tests).
> >>
> >
> > Thanks a lot for testing! Could you repeat for 0.7 version of patch which
> > has new overflow handling?
> >
>
> I've attached a similar test re-run that compares HEAD with patch versions
> 0.6, and 0.7.
>

Thanks! Did you write scripts for automated testing? I would be nice if you
share them.

------
With best regards,
Alexander Korotkov.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-02 18:19:31
Message-ID:	25791.1354472371@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Nice idea to delay expanding colors to characters! Obviously, we should
> delay expanding inly alphanumerical characters. Because non-alphanumberical
> characters influence graph structure. Trying to implement...

Uh, why would that be? Colors are colors. The regexp machinery doesn't
care whether they represent alphanumerics or not. (Or to be more
precise, if there is a situation where it makes a difference, separate
colors will have been created for each set of characters that need to be
distinguished.)

regards, tom lane

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-02 21:23:34
Message-ID:	13f88a3db309c55b8c7361a05ca3e95f.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, December 2, 2012 19:07, Alexander Korotkov wrote:
>>
>> I've attached a similar test re-run that compares HEAD with patch versions
>> 0.6, and 0.7.
>>
>
> Thanks! Did you write scripts for automated testing? I would be nice if you
> share them.
>

Sure, here they are.

The perl program does depend a bit on my particular setup (it reads the port from the
postgresql.conf of each instance, and the script knows the data_dir locations), but I suppose it's
easy enough to remove that.

(Just hardcode the %instances hash, instead of calling instances().)

If you need help to get them to run I can 'generalise' them, but for now I'll send them as they are.

Erik Rijkers

Attachment	Content-Type	Size
create_data.sh	application/x-shellscript	937 bytes
trgm_regex_test.pl	application/x-perl	7.9 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-03 10:05:16
Message-ID:	50BC795C.5000509@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02.12.2012 20:19, Tom Lane wrote:
> Alexander Korotkov<aekorotkov(at)gmail(dot)com> writes:
>> Nice idea to delay expanding colors to characters! Obviously, we should
>> delay expanding inly alphanumerical characters. Because non-alphanumberical
>> characters influence graph structure. Trying to implement...
>
> Uh, why would that be? Colors are colors. The regexp machinery doesn't
> care whether they represent alphanumerics or not. (Or to be more
> precise, if there is a situation where it makes a difference, separate
> colors will have been created for each set of characters that need to be
> distinguished.)

The regexp machinery doesn't care, but the trigrams that pg_trgm
extracts only contain alphanumeric characters. So if by looking at the
CNFA graph produced by the regexp machinery you conclude that any
matching strings must contain three-letter sequences "%oo" and "#oo",
you can just club them together into " oo" trigram.

I think you can run a pre-processing step to the colors, and merge
colors that are equivalent as far as trigrams are considered. For
example, if you have a color that contains only character '%', and
another that contains character '#', you can treat them as the same
hcolor. You might then be able to simplify the CNFA. Actually, it would
be even better if you could apply the pre-processing to the regexp
before the regexp machinery turns it into a CNFA. Not sure how easy it
would be to do such pre-processing.

BTW, why create the path matrix? You could check the "check" array of
trigrams in the consistent function directly against the graph.
Consistent should return true, iff there is a path through the graph
following only arcs that contain trigrams present in the check array.
Finding a path through a complex graph could be expensive, O(|E|), but
if the path is complex, the path matrix would be large as well, and
checking against a large matrix isn't exactly free either. It would
allow you to avoid "overflows" caused by having too many paths through
the graph.

- Heikki

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-03 12:31:31
Message-ID:	CAPpHfdshP6ituomYi0VUvEiN6pWsMcm+XztTjo7TeP1wUCXuVg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com>wrote:

> On 02.12.2012 20:19, Tom Lane wrote:
>
>> Alexander Korotkov<aekorotkov(at)gmail(dot)com> writes:
>>
>>> Nice idea to delay expanding colors to characters! Obviously, we should
>>> delay expanding inly alphanumerical characters. Because
>>> non-alphanumberical
>>> characters influence graph structure. Trying to implement...
>>>
>>
>> Uh, why would that be? Colors are colors. The regexp machinery doesn't
>> care whether they represent alphanumerics or not. (Or to be more
>> precise, if there is a situation where it makes a difference, separate
>> colors will have been created for each set of characters that need to be
>> distinguished.)
>>
>
> The regexp machinery doesn't care, but the trigrams that pg_trgm extracts
> only contain alphanumeric characters. So if by looking at the CNFA graph
> produced by the regexp machinery you conclude that any matching strings
> must contain three-letter sequences "%oo" and "#oo", you can just club them
> together into " oo" trigram.
>
> I think you can run a pre-processing step to the colors, and merge colors
> that are equivalent as far as trigrams are considered. For example, if you
> have a color that contains only character '%', and another that contains
> character '#', you can treat them as the same hcolor. You might then be
> able to simplify the CNFA. Actually, it would be even better if you could
> apply the pre-processing to the regexp before the regexp machinery turns it
> into a CNFA. Not sure how easy it would be to do such pre-processing.
>

Treating colors as same should be possible only for colors which has no
alphanumeric characters, because colors are non-overlapping. However, this
optimization could be significant in some cases.

BTW, why create the path matrix? You could check the "check" array of
> trigrams in the consistent function directly against the graph. Consistent
> should return true, iff there is a path through the graph following only
> arcs that contain trigrams present in the check array. Finding a path
> through a complex graph could be expensive, O(|E|), but if the path is
> complex, the path matrix would be large as well, and checking against a
> large matrix isn't exactly free either. It would allow you to avoid
> "overflows" caused by having too many paths through the graph.

Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for
search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram
while graph would contain much more.Path matrix is a method to get rid of
all of them.
2) If we use color trigrams then we need some criteria for which color
trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
from initial state to the final by unexpanded trigrams. It seems much
harder to do with graph than with matrix.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-13 21:34:01
Message-ID:	CAPpHfduFOqnVBXxNtT3SRSyy0ju4WKL6OY9sYqY6HQF+W0gGag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> Actually, I generally dislike path matrix for same reasons. But:
> 1) Output graphs could contain trigrams which are completely useless for
> search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram
> while graph would contain much more.Path matrix is a method to get rid of
> all of them.
> 2) If we use color trigrams then we need some criteria for which color
> trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
> from initial state to the final by unexpanded trigrams. It seems much
> harder to do with graph than with matrix.
>

Now, I have an idea about doing some not comprehensive but simple and fast
simplification of graph. I'm doing experiments now. In case of success we
could get rid of path matrix.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-16 21:25:06
Message-ID:	CAPpHfdsi5jUP_fHhfhtPt4VGagT+oSRnwxbpeb7tFLPj+ew_1Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 14, 2012 at 1:34 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> Actually, I generally dislike path matrix for same reasons. But:
>> 1) Output graphs could contain trigrams which are completely useless for
>> search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram
>> while graph would contain much more.Path matrix is a method to get rid of
>> all of them.
>> 2) If we use color trigrams then we need some criteria for which color
>> trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
>> from initial state to the final by unexpanded trigrams. It seems much
>> harder to do with graph than with matrix.
>>
>
> Now, I have an idea about doing some not comprehensive but simple and fast
> simplification of graph. I'm doing experiments now. In case of success we
> could get rid of path matrix.
>

Attached patch have following changes:
1) Postphone expansion of colors. Graph are building on color trigrams.
2) Selective expansion of color trigrams into simple trigrams. All
non-expanded color trigrams are removed. Such removal leads to union of all
states pairs connected with corresponding arcs. Surely, this must no lead
to union of initial and final states: that could do all previous work
senseless.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.8.patch.gz	application/x-gzip	16.2 KB

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-17 08:54:59
Message-ID:	0722c2960fc45b455cd5ecf30069215e.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, December 16, 2012 22:25, Alexander Korotkov wrote:

> trgm-regexp-0.8.patch.gz 22 k

Hi Alexander,

I gave this a quick try; the patch works when compiled for DEBUG, but crashes as a
'speed'-compiled binary:

Compile for speed:

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8' '--with-pgport=6556'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

$ psql
psql (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761)
Type "help" for help.

testdb=# explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q';
The connection to the server was lost. Attempting reset: Failed.
!> \q

log after such:
-----------------
2012-12-17 09:31:02.337 CET 15801 LOG: server process (PID 16903) was terminated by signal 11:
Segmentation fault
2012-12-17 09:31:02.337 CET 15801 DETAIL: Failed process was running: explain analyze select txt
from azjunk5 where txt ~ 'x[aeiouy]{2,5}q';
2012-12-17 09:31:02.347 CET 15801 LOG: terminating any other active server processes
2012-12-17 09:31:02.348 CET 17049 FATAL: the database system is in recovery mode
2012-12-17 09:31:02.722 CET 15801 LOG: all server processes terminated; reinitializing
2012-12-17 09:31:03.506 CET 17052 LOG: database system was interrupted; last known up at
2012-12-17 09:26:00 CET
2012-12-17 09:31:03.693 CET 17052 LOG: database system was not properly shut down; automatic
recovery in progress
2012-12-17 09:31:04.493 CET 17052 LOG: record with zero length at 2/7E3C7588
2012-12-17 09:31:04.494 CET 17052 LOG: redo is not required
2012-12-17 09:31:06.940 CET 15801 LOG: database system is ready to accept connections
----------------

A debug-compile with below options runs OK (so far):

Compile for debug:

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8b' '--with-pgport=6560'
'--enable-depend' '--enable-cassert' '--enable-debug' '--with-openssl' '--with-perl'
'--with-libxml'

which does show some speed gain, I think. When I have time I'll post comparisons between HEAD,
versions 6, 7, 8. (BTW, is v6 still interesting?)

Thanks,

Erik Rijkers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-17 09:16:42
Message-ID:	CAPpHfdsmkUkrXX5akux6W+CxYD4BLaWgbgkXvc=_L+X0-jZLHQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Mon, Dec 17, 2012 at 12:54 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Sun, December 16, 2012 22:25, Alexander Korotkov wrote:
>
> > trgm-regexp-0.8.patch.gz 22 k
>
> Hi Alexander,
>
> I gave this a quick try; the patch works when compiled for DEBUG, but
> crashes as a
> 'speed'-compiled binary:
>
> Compile for speed:
>
> $ pg_config --configure
> '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8'
> '--with-pgport=6556'
> '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'
>
> $ psql
> psql
> (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761)
> Type "help" for help.
>
> testdb=# explain analyze select txt from azjunk5 where txt ~
> 'x[aeiouy]{2,5}q';
> The connection to the server was lost. Attempting reset: Failed.
> !> \q
>

Didn't reproduce it yet. Can you retry it with this line uncommented:
#define TRGM_REGEXP_DEBUG
Then we can see which stage it fails.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-18 07:04:49
Message-ID:	CAPpHfdvkBnszcO1vXCYn6J0Za1OxNBbaLS5Mps+sRwt_AkKYhg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 17, 2012 at 1:16 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> Didn't reproduce it yet. Can you retry it with this line uncommented:
> #define TRGM_REGEXP_DEBUG
> Then we can see which stage it fails.
>

Bug is found and fixed in attached patch.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.9.patch.gz	application/x-gzip	16.2 KB

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-18 07:45:32
Message-ID:	03adf596c63967f0b5c229084a8c4d42.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, December 18, 2012 08:04, Alexander Korotkov wrote:

> trgm-regexp-0.9.patch.gz 22 k

Hi.

I ran the same test again: HEAD versus trgm_regex v6, 7 and 9. In v9 there is some gain but also
some regression.

It remains a difficult problem...

If I get some time in the holidays I'll try to diversify the test program; it is now too simple.

Thanks,

Erik Rijkers

Attachment	Content-Type	Size
trgmre_6_7_9.txt.gz	application/x-gzip	15.1 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-18 08:45:27
Message-ID:	CAPpHfdsLXeLT5bt+mLf3gDwmjojSk9U0fepLr8dPrsLPKOwuzA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 18, 2012 at 11:45 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Tue, December 18, 2012 08:04, Alexander Korotkov wrote:
> I ran the same test again: HEAD versus trgm_regex v6, 7 and 9. In v9
> there is some gain but also
> some regression.
>
> It remains a difficult problem...
>
> If I get some time in the holidays I'll try to diversify the test program;
> it is now too simple.
>

Note, that regexes which contains {,n} are likely not what do you expect.

test=# select 'xq' ~ 'x[aeiou]{,2}q';
?column?
----------
f
(1 row)

test=# select 'xa{,2}q' ~ 'x[aeiou]{,2}q';
?column?
----------
t
(1 row)

You should use {0,n} to express from 0 to n occurences.

------
With best regards,
Alexander Korotkov.

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-18 08:51:45
Message-ID:	e490ecadc15af75e0edc6a1d02db4d80.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, December 18, 2012 09:45, Alexander Korotkov wrote:
>
> You should use {0,n} to express from 0 to n occurences.
>

Thanks, but I know that of course. It's a testing program; and in the end robustness with
unexpected or even wrong input is as important as performance. (to put it bluntly, I am also
trying to get your patch to fall over ;-))

Thanks,

Erik Rijkers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2012-12-18 09:10:00
Message-ID:	CAPpHfdswF+FHrNtBnCgbwf-hxLUcMssf=L7cX_CYAq5ncUsrPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 18, 2012 at 12:51 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Tue, December 18, 2012 09:45, Alexander Korotkov wrote:
> >
> > You should use {0,n} to express from 0 to n occurences.
> >
>
>
> Thanks, but I know that of course. It's a testing program; and in the end
> robustness with
> unexpected or even wrong input is as important as performance. (to put it
> bluntly, I am also
> trying to get your patch to fall over ;-))
>

I found most of regressions in 0.9 version to be in {,n} cases. New version
of patch use more of trigrams than previous versions.
For example for regex 'x[aeiou]{,2}q'.
In 0.7 version we use trigrams '__2', '_2_' and '__q'.
In 0.9 version we use trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_', '__2',
'_2_' and '__q'.

But, actually trigram '__2' or '_2_' never occurs. It enough to have one of
them, all others are just causing a slowdown. Simultaneously, we can't
decide reasonably which trigrams to use without knowing their frequencies.
For example, if trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_' were altogether
more rare than '__2', newer version of patch would be faster.

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-22 19:37:38
Message-ID:	50FEEA82.3010308@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18.12.2012 09:04, Alexander Korotkov wrote:
> Bug is found and fixed in attached patch.

I finally got around to look at this. I like this new version, without
the path matrix, much better.

I spend quite some time honing the code and comments, trying to organize
it so that it's easier to understand. In particular, I divided the
processing more clearly into four separate stages, and added comments
indicating which functions and which fields in the structs are needed in
which state.

I understand the other stages fairly well now, but the transformation
from the source CNFA form into the transformed graph is still a black
box to me. The addKeys/addArcs functions still need more explanation.
Can you come up with some extra comments or refactoring to clarify those?

I'd like to see a few more regression test cases, to cover the various
overflow cases. In particular, I built with --enable-coverage and ran
"make installcheck", and it looks like the state merging code isn't
exercised at all. Report attached.

To visualize the graphs, I rewrote the debugging print* functions to
write the graphs in graphviz .dot format. That helped a lot. See
attached graphs, generated from the regexp '^(abc|def)(ghi|jk[lmn])$'.

There's still a lot of cleanup to do, I'm going to continue working on
this tomorrow, but wanted to shared what I have this far.

- Heikki

Attachment	Content-Type	Size
trgm-regexp-0.9-heikki-1.patch	text/x-diff	61.8 KB
trgm-regexp-lcov-report.tar.gz	application/x-gzip	68.2 KB
packed.png	image/png	77.9 KB
	image/png	51.1 KB
	image/png	44.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-23 02:08:58
Message-ID:	2459.1358906938@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> I finally got around to look at this. I like this new version, without
> the path matrix, much better.

I looked through this version too. I have some notes/issues:

The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t. Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen. Not sure what a suitable API would look like
though.

I think the assumption that all MB characters fit in 4 bytes is
unacceptable; someday we'll want to support wider Unicode characters
than we do now, and this code seems utterly unable to handle it. It's
especially bad that the code isn't even bothering to defend itself
against the possibility of wider characters.

Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
and an upgrade script.

Comments and documentation still need a lot of copy-editing, also I
think a lot of the comment blocks will not survive pg_indent. It'd be
a good idea to run trgm_regexp.c through pg_indent as soon as you have
that fixed.

New file trgm_regexp.c lacks a copyright notice

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
enough; need to pass through the actual collation for the regex
operator.

How deep can the recursion in activateState() go? Is this a practical
production approach at all? Do we need a stack depth check there?
addKeys is also recursive, same questions. (But on the other hand, the
check_stack_depth in scanColorMap seems useless, since its recursion
depth is fixed.)

Not too happy with convertPgWchar: aside from hard-wired, unchecked
assumption about maximum length of pg_wchar2mb_with_len result, why is
it that this is doing a lowercase conversion? Surely the regex stuff
dealt with that already?

I'm suspicious of the code in addKeys() that is special-casing bos[1]
and eos[1] but not the other cases (BOL/EOL). My recollection from
working on the fixed-prefix stuff is that it's not always obvious which
of those states gets used.

strnlen() used in fillTrgm() is probably not portable, and is definitely
not used anywhere else in Postgres.

This isn't a complete review, just some stuff I happened to notice in
one quick pass over the code.

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-23 07:36:32
Message-ID:	CAPpHfdtB1mvdb8QTv4RZKWWuWEHHeNyntkXQGVWasJ7YqwwfRw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Some quick answers to the part of notes/issues. I will provide rest of
answers soon.

On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> The biggest problem is that I really don't care for the idea of
> contrib/pg_trgm being this cozy with the innards of regex_t. Sooner
> or later we are going to want to split the regex code off again as a
> standalone library, and we *must* have a cleaner division of labor if
> that is ever to happen. Not sure what a suitable API would look like
> though.
>

The only option I see now is to provide a method like "export_cnfa" which
would export corresponding CNFA in fixed format.

> I think the assumption that all MB characters fit in 4 bytes is
> unacceptable; someday we'll want to support wider Unicode characters
> than we do now, and this code seems utterly unable to handle it. It's
> especially bad that the code isn't even bothering to defend itself
> against the possibility of wider characters.
>

In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use
it in type definition. Is this way ok?

> Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
> and an upgrade script.
>

Fixed.

> Comments and documentation still need a lot of copy-editing, also I
> think a lot of the comment blocks will not survive pg_indent. It'd be
> a good idea to run trgm_regexp.c through pg_indent as soon as you have
> that fixed.
>

Fixed.

> New file trgm_regexp.c lacks a copyright notice
>

Fixed.

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
> enough; need to pass through the actual collation for the regex
> operator.
>

We have collation passed to gin_extract_query in ginscan.c. I noticed
that gincost_pattern don't pass collation to gin_extract_query. Is it a
bug? Anyway this is not collation we need. We need collation used in
operator clause. In attached patch I introduce additional argument to
gin_extract_query which represent collation of operator clause. Do you
think it is reasonable change in GIN interface? If so, I will provide it as
separate patch.

> Not too happy with convertPgWchar: aside from hard-wired, unchecked
> assumption about maximum length of pg_wchar2mb_with_len result, why is
> it that this is doing a lowercase conversion? Surely the regex stuff
> dealt with that already?
>

Trigrams are already lowercased. We can simplify our calculations by
excluding uppercased characters from consideration.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.10.patch.gz	application/x-gzip	20.0 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-23 07:59:58
Message-ID:	50FF987E.507@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23.01.2013 09:36, Alexander Korotkov wrote:
> Hi!
>
> Some quick answers to the part of notes/issues. I will provide rest of
> answers soon.
>
> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> The biggest problem is that I really don't care for the idea of
>> contrib/pg_trgm being this cozy with the innards of regex_t. Sooner
>> or later we are going to want to split the regex code off again as a
>> standalone library, and we *must* have a cleaner division of labor if
>> that is ever to happen. Not sure what a suitable API would look like
>> though.
>
> The only option I see now is to provide a method like "export_cnfa" which
> would export corresponding CNFA in fixed format.

Yeah, I think that makes sense. The transformation code in trgm_regexp.c
would probably be more readable too, if it didn't have to deal with the
regex guts representation of the CNFA. Also, once you have intermediate
representation of the original CNFA, you could do some of the
transformation work on that representation, before building the
"tranformed graph" containing trigrams. You could eliminate any
non-alphanumeric characters, joining states connected by arcs with
non-alphanumeric characters, for example.

- Heikki

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-23 15:29:56
Message-ID:	19081.1358954996@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 23.01.2013 09:36, Alexander Korotkov wrote:
>> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> The biggest problem is that I really don't care for the idea of
>>> contrib/pg_trgm being this cozy with the innards of regex_t.

>> The only option I see now is to provide a method like "export_cnfa" which
>> would export corresponding CNFA in fixed format.

> Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> would probably be more readable too, if it didn't have to deal with the
> regex guts representation of the CNFA. Also, once you have intermediate
> representation of the original CNFA, you could do some of the
> transformation work on that representation, before building the
> "tranformed graph" containing trigrams. You could eliminate any
> non-alphanumeric characters, joining states connected by arcs with
> non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters. Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO. So I'm not sure how we
divvy that up sanely. To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all. But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Anyway, I had another thought in the shower this morning, which is that
solving this problem in terms of color trigrams is really the Wrong
Thing. ISTM it'd be better to think of the CNFA traversal logic as
looking for required sequences of colors of unspecified length, which
we'd then chop into trigrams after the fact. This might produce
slightly better (more complete) trigram sets, but the real reason I'm
suggesting it is that I think the CNFA traversal code might be subject
to Polya's Inventor's Paradox: "the more general problem may be easier
to solve". It seems like casting the goal of that code as being to
find variable-length sequences, rather than exactly trigrams, might
lead to simpler data structures and more readable algorithms. The
still-to-be-designed regex API for this also seems like it would be
better off if decoupled from the notion of trigrams.

It's quite possible that this idea is all wet and no meaningful
improvement can be gotten this way. But I offer it for your
consideration.

regards, tom lane

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-25 07:47:08
Message-ID:	b4bce89ef38fe5bd4b191e677aa6bd1a.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, January 23, 2013 08:36, Alexander Korotkov wrote:
> Hi!
>
> Some quick answers to the part of notes/issues. I will provide rest of
> answers soon.
>
[...]
> trgm-regexp-0.10.patch.gz 27 k

Trying to build this I get, after 'make install' in contrib/ :

/usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or directory
/usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or directory
make[1]: *** [install] Error 1
make: *** [install-pg_trgm-recurse] Error 2

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-27 18:40:01
Message-ID:	CAPpHfduygeo1XrKthxUq1VASJdp26V8X=BgPKSvN70bxGe8QeQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 25, 2013 at 11:47 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Wed, January 23, 2013 08:36, Alexander Korotkov wrote:
> > Hi!
> >
> > Some quick answers to the part of notes/issues. I will provide rest of
> > answers soon.
> >
> [...]
> > trgm-regexp-0.10.patch.gz 27 k
>
> Trying to build this I get, after 'make install' in contrib/ :
>
> /usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or
> directory
> /usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or
> directory
> make[1]: *** [install] Error 1
> make: *** [install-pg_trgm-recurse] Error 2
>

Forgot to include these files into patch.

Another changes in new version of patch:
1) Get rid of recursion in trigramsMatchGraph and addKeys.
2) Both bos[0] and eos[0] are also included into checks.
3) Get rid of strnlen.
4) I found the way to fix work with collation in previous version of patch
to be wrong. Collation of operator must match collation of indexed column
for index scan. Only thing to fix is passing collation in gincost_pattern.
5) Some tests were added which improves the coverage.

Now I'm working on additional comments.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.11.patch.gz	application/x-gzip	20.2 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-01-28 17:37:50
Message-ID:	CAPpHfdtX_+vwYaJ89dCTQQJwajaqTLiEuaMCRKLFS9=88u1Rmw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 27, 2013 at 10:40 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com>wrote:

> Now I'm working on additional comments.

Some comments were added for addKey and addArc(s). I hope they clarify
something.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.12.patch.gz	application/x-gzip	20.8 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-03-06 09:06:26
Message-ID:	CAPpHfdtkPgtDANjAXMnyAycpsahgGedQZ7VU+KfW6Y_5Jx1O=g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> > On 23.01.2013 09:36, Alexander Korotkov wrote:
> >> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> The biggest problem is that I really don't care for the idea of
> >>> contrib/pg_trgm being this cozy with the innards of regex_t.
>
> >> The only option I see now is to provide a method like "export_cnfa"
> which
> >> would export corresponding CNFA in fixed format.
>
> > Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> > would probably be more readable too, if it didn't have to deal with the
> > regex guts representation of the CNFA. Also, once you have intermediate
> > representation of the original CNFA, you could do some of the
> > transformation work on that representation, before building the
> > "tranformed graph" containing trigrams. You could eliminate any
> > non-alphanumeric characters, joining states connected by arcs with
> > non-alphanumeric characters, for example.
>
> It's not just the CNFA though; the other big API problem is with mapping
> colors back to characters. Right now, that not only knows way too much
> about a part of the regex internals we have ambitions to change soon,
> but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
> which should be known to the regex library IMO. So I'm not sure how we
> divvy that up sanely. To be clear: I'm not going to insist that we have
> to have a clean API factorization before we commit this at all. But it
> worries me if we don't even know how we could get to that, because we
> are going to need it eventually.
>

Now, we probably don't have enough of time before 9.3 to solve an API
problem :(. It's likely we have to choose either commit to 9.3 without
clean API factorization or postpone it to 9.4.

------
With best regards,
Alexander Korotkov.

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-03-06 13:17:03
Message-ID:	20130306131703.GW16142@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Alexander Korotkov (aekorotkov(at)gmail(dot)com) wrote:
> Now, we probably don't have enough of time before 9.3 to solve an API
> problem :(. It's likely we have to choose either commit to 9.3 without
> clean API factorization or postpone it to 9.4.

As much as I'd like this to get in, I don't think there's a question
here- it should be punted to 9.4. These don't look like small issues to
me.

Thanks,

Stephen

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-03-14 17:40:06
Message-ID:	CAPpHfdsuW_87rN9BxZ70aP_TeK2esovFxS0b4ep--aoLP06rmw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

Now I have following idea about API.
Put code of stage 2 (transform the original CNFA into an automaton-like
graph) into regex engine. It would use API which describes what exactly are
we going to extract from CNFA. This API could look like this.

typedef char *Prefix;
typedef char *ArcLabel;

typedef struct
{
Prefix newPrefix;
ArcLabel label;
} ArcInfo;

typedef struct
{
Prefix (*getInitialPrefix) ();
bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
void (*freePrefix) (Prefix prefix);
void (*freeArcLabel) (ArcLabel arcLabel);
} CFNATransformAPI;

getInitialPrefix returns initial prefix value like now this code does:
> initkey.prefix.colors[0] = UNKNOWN_COLOR;
> initkey.prefix.colors[1] = UNKNOWN_COLOR;
prefixContains are exactly same as function with this name.
getPrefixes and getArcs cycle step work of addKeys an addArcs.
freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel
strutures.

Additionally regex engine should provide correct way to examine colormap.
int getColorCharsCount(colormap *cm, color c);
pg_wchar *getColorChars(colormap *cm, color c);
getColorCharsCount would return -1 if this color should be considered as
unexpandable.

Any thoughts?

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-03-21 15:07:02
Message-ID:	CAPpHfdtTxBiwGX5CuwQn2NKQJTYLsCghQDAcUR8D-BF=_hoSMw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 14, 2013 at 9:40 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> > On 23.01.2013 09:36, Alexander Korotkov wrote:
>> >> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >>> The biggest problem is that I really don't care for the idea of
>> >>> contrib/pg_trgm being this cozy with the innards of regex_t.
>>
>> >> The only option I see now is to provide a method like "export_cnfa"
>> which
>> >> would export corresponding CNFA in fixed format.
>>
>> > Yeah, I think that makes sense. The transformation code in trgm_regexp.c
>> > would probably be more readable too, if it didn't have to deal with the
>> > regex guts representation of the CNFA. Also, once you have intermediate
>> > representation of the original CNFA, you could do some of the
>> > transformation work on that representation, before building the
>> > "tranformed graph" containing trigrams. You could eliminate any
>> > non-alphanumeric characters, joining states connected by arcs with
>> > non-alphanumeric characters, for example.
>>
>> It's not just the CNFA though; the other big API problem is with mapping
>> colors back to characters. Right now, that not only knows way too much
>> about a part of the regex internals we have ambitions to change soon,
>> but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
>> which should be known to the regex library IMO. So I'm not sure how we
>> divvy that up sanely. To be clear: I'm not going to insist that we have
>> to have a clean API factorization before we commit this at all. But it
>> worries me if we don't even know how we could get to that, because we
>> are going to need it eventually.
>>
>
> Now I have following idea about API.
> Put code of stage 2 (transform the original CNFA into an automaton-like
> graph) into regex engine. It would use API which describes what exactly are
> we going to extract from CNFA. This API could look like this.
>
> typedef char *Prefix;
> typedef char *ArcLabel;
>
> typedef struct
> {
> Prefix newPrefix;
> ArcLabel label;
> } ArcInfo;
>
> typedef struct
> {
> Prefix (*getInitialPrefix) ();
> bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
> Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
> ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
> void (*freePrefix) (Prefix prefix);
> void (*freeArcLabel) (ArcLabel arcLabel);
> } CFNATransformAPI;
>
> getInitialPrefix returns initial prefix value like now this code does:
> > initkey.prefix.colors[0] = UNKNOWN_COLOR;
> > initkey.prefix.colors[1] = UNKNOWN_COLOR;
> prefixContains are exactly same as function with this name.
> getPrefixes and getArcs cycle step work of addKeys an addArcs.
> freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel
> strutures.
>
> Additionally regex engine should provide correct way to examine colormap.
> int getColorCharsCount(colormap *cm, color c);
> pg_wchar *getColorChars(colormap *cm, color c);
> getColorCharsCount would return -1 if this color should be considered as
> unexpandable.
>

Now I have working implemetation of this API. Comments still need rework.
Could you give me any feedback?

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.13.patch.gz	application/x-gzip	23.2 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-03-24 21:50:44
Message-ID:	5280.1364161844@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Now I have working implemetation of this API. Comments still need rework.
> Could you give me any feedback?

I looked at this a little bit, but it's not very far along at all
towards resolving my API worries. The basic point that I'm concerned
about is that we would like to split off the regex library (ie,
backend/regex/) as a standalone project someday. There are already
some problems to be resolved to make that possible, and I don't want
this patch to introduce new ones. From that standpoint, the proposed
regextract.c file is a disaster, because it depends on a boatload of
Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
to say nothing of palloc). We can't consider that to be on the regex
side of the fence.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
into regex/regex.h is completely not the right thing for preserving a
clear library boundary (even positing that we want to expose that
function outside adt/regexp.c, which I'd rather we didn't).

Given that you've already got a notion of callbacks provided by
contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
into those callbacks, so that the heavy-duty data structures like the
hash table live over there and the API exposed by backend/regex/ is at
a much simpler level than what you have here. But right now I don't
see any usable library API here.

Perhaps we could avoid these issues by defining a library API that
provides accessors like these for the opaque regex_t struct:

* get the number of states in the CNFA

* get the numbers of the initial and final states

* get the number of out-arcs for the N'th state

* get the out-arcs for the N'th state into a caller-provided array
(sized using the previous function), where each out-arc is represented
by a color and an end-state

* get the number of character codes represented by color C

* get the wchar codes for color C into a caller-provided array

(The reason for letting the caller allocate the result arrays is so we
can use palloc for that; if we allocate it in backend/regex/ we must
use malloc, which will greatly increase the risk of leakages. Also,
as far as the color API goes, the above lets the caller decide how
many characters is "too many" to bother with.)

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-03-24 22:24:33
Message-ID:	CAPpHfdsRdG1M9HU1=71WM1eXYHGDXSooyuAKPXYZQmrfmnKcGA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > Now I have working implemetation of this API. Comments still need rework.
> > Could you give me any feedback?
>
> I looked at this a little bit, but it's not very far along at all
> towards resolving my API worries. The basic point that I'm concerned
> about is that we would like to split off the regex library (ie,
> backend/regex/) as a standalone project someday. There are already
> some problems to be resolved to make that possible, and I don't want
> this patch to introduce new ones. From that standpoint, the proposed
> regextract.c file is a disaster, because it depends on a boatload of
> Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
> to say nothing of palloc). We can't consider that to be on the regex
> side of the fence.
>

Now I can see your position about API much more clearly. Previously I
thought only about algorithmic side of things.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
> into regex/regex.h is completely not the right thing for preserving a
> clear library boundary (even positing that we want to expose that
> function outside adt/regexp.c, which I'd rather we didn't).
>
> Given that you've already got a notion of callbacks provided by
> contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
> into those callbacks, so that the heavy-duty data structures like the
> hash table live over there and the API exposed by backend/regex/ is at
> a much simpler level than what you have here. But right now I don't
> see any usable library API here.
>
> Perhaps we could avoid these issues by defining a library API that
> provides accessors like these for the opaque regex_t struct:
>
> * get the number of states in the CNFA
>
> * get the numbers of the initial and final states
>
> * get the number of out-arcs for the N'th state
>
> * get the out-arcs for the N'th state into a caller-provided array
> (sized using the previous function), where each out-arc is represented
> by a color and an end-state
>
> * get the number of character codes represented by color C
>
> * get the wchar codes for color C into a caller-provided array
>
> (The reason for letting the caller allocate the result arrays is so we
> can use palloc for that; if we allocate it in backend/regex/ we must
> use malloc, which will greatly increase the risk of leakages. Also,
> as far as the color API goes, the above lets the caller decide how
> many characters is "too many" to bother with.)
>

I like the this idea. Seems like clear and not over-engineered API. I can
implement it. Could you propose something particular to do with
RE_compile_and_cache in this patch? With this API we still need a way to
get regex_t or equivalent from string.

------
With best regards,
Alexander Korotkov.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject:	Re: WIP: index support for regexp search
Date:	2013-03-24 22:38:47
Message-ID:	6275.1364164727@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Similarly, pushing PG-specific declarations like RE_compile_and_cache()
>> into regex/regex.h is completely not the right thing for preserving a
>> clear library boundary (even positing that we want to expose that
>> function outside adt/regexp.c, which I'd rather we didn't).
>>
>> Perhaps we could avoid these issues by defining a library API that
>> provides accessors like these for the opaque regex_t struct:
>>
>> * get the number of states in the CNFA
>>
>> * get the numbers of the initial and final states
>>
>> * get the number of out-arcs for the N'th state
>>
>> * get the out-arcs for the N'th state into a caller-provided array
>> (sized using the previous function), where each out-arc is represented
>> by a color and an end-state
>>
>> * get the number of character codes represented by color C
>>
>> * get the wchar codes for color C into a caller-provided array
>>
>> (The reason for letting the caller allocate the result arrays is so we
>> can use palloc for that; if we allocate it in backend/regex/ we must
>> use malloc, which will greatly increase the risk of leakages. Also,
>> as far as the color API goes, the above lets the caller decide how
>> many characters is "too many" to bother with.)

> I like the this idea. Seems like clear and not over-engineered API. I can
> implement it. Could you propose something particular to do with
> RE_compile_and_cache in this patch? With this API we still need a way to
> get regex_t or equivalent from string.

Well, the brute force answer is for pg_trgm to not go through
RE_compile_and_cache at all, but just call the regex library for itself.
That would lose the ability to cache regex compilations, but I'm not
sure we care. The motivation for RE_compile_and_cache is mainly to
prevent having to compile the regex again for *every row*, which is what
would otherwise happen in a simple SELECT using a regex function.
Unless I'm misunderstanding something, pg_trgm would only need to
compile the regex once per indexscan, which is probably tolerable.

Another approach we could take is for adt/regexp.c to expose a function
that collects the deconstructed CNFA data into a palloc'd structure
using the above-proposed functions. This'd be a reasonable way to go if
we decide the caching behavior is actually needed for this purpose; but
otherwise it seems like it's just involving one more module than
necessary.

BTW, one reason I don't like exposing RE_compile_and_cache directly is
that then you have the question of how long is the regex_t it hands back
good for. The caller doesn't own that struct, and at some point
regexp.c is likely to recycle it, so it's a bit risky to just assume
it'll stay around "long enough" without any further interaction with
regexp.c.

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stěhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-01 21:15:27
Message-ID:	CAPpHfduK1u+ykcwKmjOTe30e6JmStMg6LUCmNttGinvYSQQ0RQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 25, 2013 at 2:38 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Similarly, pushing PG-specific declarations like RE_compile_and_cache()
> >> into regex/regex.h is completely not the right thing for preserving a
> >> clear library boundary (even positing that we want to expose that
> >> function outside adt/regexp.c, which I'd rather we didn't).
> >>
> >> Perhaps we could avoid these issues by defining a library API that
> >> provides accessors like these for the opaque regex_t struct:
> >>
> >> * get the number of states in the CNFA
> >>
> >> * get the numbers of the initial and final states
> >>
> >> * get the number of out-arcs for the N'th state
> >>
> >> * get the out-arcs for the N'th state into a caller-provided array
> >> (sized using the previous function), where each out-arc is represented
> >> by a color and an end-state
> >>
> >> * get the number of character codes represented by color C
> >>
> >> * get the wchar codes for color C into a caller-provided array
> >>
> >> (The reason for letting the caller allocate the result arrays is so we
> >> can use palloc for that; if we allocate it in backend/regex/ we must
> >> use malloc, which will greatly increase the risk of leakages. Also,
> >> as far as the color API goes, the above lets the caller decide how
> >> many characters is "too many" to bother with.)
>
> > I like the this idea. Seems like clear and not over-engineered API. I can
> > implement it. Could you propose something particular to do with
> > RE_compile_and_cache in this patch? With this API we still need a way to
> > get regex_t or equivalent from string.
>
> Well, the brute force answer is for pg_trgm to not go through
> RE_compile_and_cache at all, but just call the regex library for itself.
> That would lose the ability to cache regex compilations, but I'm not
> sure we care. The motivation for RE_compile_and_cache is mainly to
> prevent having to compile the regex again for *every row*, which is what
> would otherwise happen in a simple SELECT using a regex function.
> Unless I'm misunderstanding something, pg_trgm would only need to
> compile the regex once per indexscan, which is probably tolerable.
>

For me it makes sense.
Patch with proposed API implementation is attached.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.14.patch.gz	application/x-gzip	23.1 KB

From:	"Erikjan Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-02 20:36:22
Message-ID:	b1711972ccf9c13d4c5a5c64e729cc92.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, April 1, 2013 23:15, Alexander Korotkov wrote:

[trgm-regexp-0.14.patch.gz]

Hi Alexander,

Something went wrong in this version of the patch: many (most) queries that were earlier
spectacularly fast have become slow, often slower than a seqscan or only marginally faster. See
the attached numbers; it compares head(seqscan) with trgm-regex patch versions 13 and 14.

I did not even complete the test-run because version 14 is so clearly inferior to 13 (and earlier,
as far as I can remember).

(let me know if you want the whole result, I can run it overnight)

Thanks,

Erik Rijkers

Attachment	Content-Type	Size
re-head-13-14-20130402-2219.txt.bz2	application/x-bzip2	10.5 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erikjan Rijkers <er(at)xs4all(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-02 21:54:16
Message-ID:	CAPpHfdue_xaTx6q+a6x5Zks9Y6WyHPYAsk8ko810ANmt--DOcg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 3, 2013 at 12:36 AM, Erikjan Rijkers <er(at)xs4all(dot)nl> wrote:

> On Mon, April 1, 2013 23:15, Alexander Korotkov wrote:
>
> [trgm-regexp-0.14.patch.gz]
>
> Hi Alexander,
>

Hi Erik!

> Something went wrong in this version of the patch: many (most) queries
> that were earlier
> spectacularly fast have become slow, often slower than a seqscan or only
> marginally faster. See
> the attached numbers; it compares head(seqscan) with trgm-regex patch
> versions 13 and 14.
>
> I did not even complete the test-run because version 14 is so clearly
> inferior to 13 (and earlier,
> as far as I can remember).
>
> (let me know if you want the whole result, I can run it overnight)
>

Yes, there was bug in new color map scan implementation. It was fixed in
attached version. Could you re-run tests please?
Attached version of patch also addresses problem that we now using API it's
not correct to use special values of color number. It introduces additional
structure ExtendedColor which contain ExtendedColorClass field for handling
additional "virtual" colors introduced by analysis.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-0.15.patch.gz	application/x-gzip	23.5 KB

From:	"Erikjan Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-03 07:10:08
Message-ID:	a21ae23463c1e89178169c38e8192d26.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, April 2, 2013 23:54, Alexander Korotkov wrote:

> [trgm-regexp-0.15.patch.gz]

Yes, it does look good now; Attached a list of measurements. Most of the searches that I put in
that test-program are now speeded up very much.

There still are a few regressions, for example:

HEAD azjunk6 x[aeiou]{4,5}q 83 Seq Scan 1393.465 ms
trgm_regex15 azjunk6 x[aeiou]{4,5}q 83 Bitmap Heap Scan 1728.319 ms

HEAD azjunk7 x[aeiou]{1,3}q 190031 Seq Scan 16819.555 ms
trgm_regex15 azjunk7 x[aeiou]{1,3}q 190031 Bitmap Heap Scan 21286.804 ms

Not exactly negligible, and ideally those regressions would be removed but with the huge
advantages for other cases I'd say it's worth it.

hth,

Erik Rijkers

Attachment	Content-Type	Size
re-head-13-15-20130403-0708.txt.bz2	application/x-bzip2	9.4 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erikjan Rijkers <er(at)xs4all(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-03 09:18:27
Message-ID:	CAPpHfdsitdJZNyQk5UCK0sAh1F08147pP7DkDH_Gh_Men8ofxw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 3, 2013 at 11:10 AM, Erikjan Rijkers <er(at)xs4all(dot)nl> wrote:

> On Tue, April 2, 2013 23:54, Alexander Korotkov wrote:
>
> > [trgm-regexp-0.15.patch.gz]
>
> Yes, it does look good now; Attached a list of measurements. Most of the
> searches that I put in
> that test-program are now speeded up very much.
>
> There still are a few regressions, for example:
>
> HEAD azjunk6 x[aeiou]{4,5}q 83 Seq Scan
> 1393.465 ms
> trgm_regex15 azjunk6 x[aeiou]{4,5}q 83 Bitmap Heap Scan
> 1728.319 ms
>
> HEAD azjunk7 x[aeiou]{1,3}q 190031 Seq Scan
> 16819.555 ms
> trgm_regex15 azjunk7 x[aeiou]{1,3}q 190031 Bitmap Heap Scan
> 21286.804 ms
>
> Not exactly negligible, and ideally those regressions would be removed but
> with the huge
> advantages for other cases I'd say it's worth it.
>

Thank you for testing!
Exploring results more detail I found version 13 to be buggy. This version
is a dead end, we have quite different API now. Could you use v12 instead
of v13 in comparison, please?
Sometimes we have regression in comparison with head in two reasons:
1) We select index scan in both cases but with patch we spent more time for
analysis. It's inevitable disadvantage of any index. We can only take care
of analysis doesn't take too long. Current testing results don't show this
reason to be significant.
2) Sometimes we select index scan while sequential scan would be faster.
It's also inevitable disadvantage until we have a relevant statistics. We
now have similar situation, for example, with in-core geometrical search
and LIKE/ILIKE search in pg_trgm. However, probably, situation could be
improved somehow even without such statistics. But I think we can do such
conclusion based on synthetical testing, because improvements for
synthetical cases could appear to be an worsening for real-life cases.

------
With best regards,
Alexander Korotkov.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-08 05:28:39
Message-ID:	15279.1365398919@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> [ trgm-regexp-0.15.patch.gz ]

I spent the weekend hacking on this, making a number of bug fixes and a
whole lot of cosmetic changes. I think there are large parts of this
that are in committable shape now, but I still find the actual graph
transformation logic to be mostly unintelligible. I think what's most
obscure is the distinction between the arcs list and the keys list of
each state in the expanded graph. I get the impression that the
general idea is for the arcs to represent exactly-known transitions
while the keys represent imprecisely-known transitions ... but there
seems to be at least some leakage between those categories. Could
you write down a specification for what's supposed to be happening
there?

regards, tom lane

Attachment	Content-Type	Size
trgm-regexp-0.16.patch.gz	application/octet-stream	27.3 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-08 12:56:32
Message-ID:	CAPpHfdvdcOaVx1z0RiQFgzhcSwd7Pkd9Wg3H-C3EbM=McFw8Tg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > [ trgm-regexp-0.15.patch.gz ]
>
> I spent the weekend hacking on this, making a number of bug fixes and a
> whole lot of cosmetic changes. I think there are large parts of this
> that are in committable shape now, but I still find the actual graph
> transformation logic to be mostly unintelligible. I think what's most
> obscure is the distinction between the arcs list and the keys list of
> each state in the expanded graph. I get the impression that the
> general idea is for the arcs to represent exactly-known transitions
> while the keys represent imprecisely-known transitions ... but there
> seems to be at least some leakage between those categories. Could
> you write down a specification for what's supposed to be happening
> there?
>

Here is my try to specify it.
At first some notions. I know, they are already in the comments, but just
in order to put it together.

Extended color - any color of source CNFA or one of two special values:
1) Unknown color - may represent any character either alphanumeric or
non-alphanumeric.
2) Blank color - may represent any non-alphanumeric character
Prefix is extended colors of last two characters read by CNFA.
Key is pair of CNFA state and prefix. So, key is a extended state which is
containing additional information which can influence further trigrams.

So, if you are in some key and traverse some CNFA arc then you moves info
another key. But there are two possible cases (or, sometimes, both of them):
1) Your move from one key into another necessary means read of some
trigram. Then you create new arc labeled with that trigram.
2) You move into another key, but you doesn't necessary read an useful
trigram. For example, arc of source CNFA is labeled by "unextractable"
color. Then you add new key into "keys" array. And outgoing arcs from this
key will also be processed similarly to source key. Therefore "keys" array
is a set of keys which are achievable from "stateKey" without reading of
useful trigram.
We could get rid of "keys" array and produce some "empty arcs" in the
second case. You can imagine that "keys" array is set of keys which are
achivable by "empty arcs" from "stateKey". States connected with "empty
arcs" could be merged on the next stage.
However, the reason of having separated addKeys stage is optimization. In
addKey function more generals keys absorb less general ones. In many cases
resulting graph becomes much simplier because of this. For example, if your
regex is not prefix, then engine puts self-referencing arcs of all possible
colors to initial state. Straight-forward processing of this could produce
enormous output graph. I had similar situation in early version of patch
where keys didn't absorb earch other.

------
With best regards,
Alexander Korotkov.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-09 05:15:06
Message-ID:	6219.1365484506@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I spent the weekend hacking on this, making a number of bug fixes and a
>> whole lot of cosmetic changes. I think there are large parts of this
>> that are in committable shape now, but I still find the actual graph
>> transformation logic to be mostly unintelligible. I think what's most
>> obscure is the distinction between the arcs list and the keys list of
>> each state in the expanded graph. I get the impression that the
>> general idea is for the arcs to represent exactly-known transitions
>> while the keys represent imprecisely-known transitions ... but there
>> seems to be at least some leakage between those categories. Could
>> you write down a specification for what's supposed to be happening
>> there?

> Here is my try to specify it.

Thanks. I hacked on this some more and committed it. I found a number
of bugs along the way with respect to handling of word boundaries
(partially-blank transition trigrams) and EOL-color ($) handling.
I think it's all fixed now but it could definitely use some more
study and testing.

One issue that bothered me is that the regression tests really don't
provide much visibility into what the code is doing. Some of the bugs
had to do with failing to generate expected trigrams, for instance
col ~ 'foo bar' only generating trigram "foo" and not "bar". This still
led to getting the right answer, so the error was invisible as far as the
tests were concerned. Is it worth thinking of a way to expose what the
extract function did at SQL level, so we could test more carefully?

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-09 08:58:21
Message-ID:	CAPpHfdtWpbVoXGU+damLwb+_-1wcRRwuzm2XSuy8Qr1aDdnDzw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 9, 2013 at 9:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I spent the weekend hacking on this, making a number of bug fixes and a
> >> whole lot of cosmetic changes. I think there are large parts of this
> >> that are in committable shape now, but I still find the actual graph
> >> transformation logic to be mostly unintelligible. I think what's most
> >> obscure is the distinction between the arcs list and the keys list of
> >> each state in the expanded graph. I get the impression that the
> >> general idea is for the arcs to represent exactly-known transitions
> >> while the keys represent imprecisely-known transitions ... but there
> >> seems to be at least some leakage between those categories. Could
> >> you write down a specification for what's supposed to be happening
> >> there?
>
> > Here is my try to specify it.
>
> Thanks. I hacked on this some more and committed it. I found a number
> of bugs along the way with respect to handling of word boundaries
> (partially-blank transition trigrams) and EOL-color ($) handling.
> I think it's all fixed now but it could definitely use some more
> study and testing.
>

Great, thanks! I also will do some more testing.

One issue that bothered me is that the regression tests really don't
> provide much visibility into what the code is doing. Some of the bugs
> had to do with failing to generate expected trigrams, for instance
> col ~ 'foo bar' only generating trigram "foo" and not "bar". This still
> led to getting the right answer, so the error was invisible as far as the
> tests were concerned. Is it worth thinking of a way to expose what the
> extract function did at SQL level, so we could test more carefully?
>

Yes, I also had similar idea. But, I think we need some relatively stable
representation of resulting graph in order to expose it. There could be a
lot of equivalent graphs. Some changes in implementation could lead to
change from one equivalent graph to another. It would be better to not
rewrite tests in this case. Ideally, we should expose some representation
which is the same for all equivalent graphs. However, it doesn't seem to be
realistic. But, I think we could at least make it stable to order sequence
of states and color trigrams. Another option I see is to expose just set of
trigrams. It doesn't have completeness of information, but it is quite
stable.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-15 13:53:41
Message-ID:	CAPpHfdtTwv_X0ev7H1MtCXSD9MOHFeRkc8-4wzvGyJa_ManHeA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I found you committed GiST index implementation. That's cool.
I found an easy way to optimize it. We can also use trigramsMatchGraph for
signatures. Attached patch contains implementation.
Simple example in order to demonstrate it:

Before the patch:

test=# explain (analyze, buffers) select * from words where s ~ '[abc]def';
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on words (cost=4.36..40.24 rows=10 width=9) (actual
time=17.189..17.193 rows=3 loops=1)
Recheck Cond: (s ~ '[abc]def'::text)
Buffers: shared hit=858
-> Bitmap Index Scan on words_trgm_idx (cost=0.00..4.36 rows=10
width=0) (actual time=17.172..17.172 rows=3 loops=1)
Index Cond: (s ~ '[abc]def'::text)
Buffers: shared hit=*857*
Total runtime: 17.224 ms
(7 rows)

After the patch:

test=# explain (analyze, buffers) select * from words where s ~ '[abc]def';
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on words (cost=4.36..40.24 rows=10 width=9) (actual
time=13.718..13.721 rows=3 loops=1)
Recheck Cond: (s ~ '[abc]def'::text)
Buffers: shared hit=498
-> Bitmap Index Scan on words_trgm_idx (cost=0.00..4.36 rows=10
width=0) (actual time=13.701..13.701 rows=3 loops=1)
Index Cond: (s ~ '[abc]def'::text)
Buffers: shared hit=*497*
Total runtime: 13.786 ms
(7 rows)

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regexp-gist-optimize.patch	application/octet-stream	1.4 KB

From:	David Fetter <david(at)fetter(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-15 15:27:37
Message-ID:	20130415152737.GI5337@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 15, 2013 at 05:53:41PM +0400, Alexander Korotkov wrote:
> I found you committed GiST index implementation. That's cool.
> I found an easy way to optimize it. We can also use trigramsMatchGraph for
> signatures. Attached patch contains implementation.
> Simple example in order to demonstrate it:
>
> Before the patch:
> Buffers: shared hit=*857*
> After the patch:
> Buffers: shared hit=*497*

Neato!

Inside the patch, s/monotonous/monotone/, I think.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Erikjan Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stìhule <pavel(dot)stehule(at)gmail(dot)com>
Subject:	Re: WIP: index support for regexp search
Date:	2013-04-15 16:50:26
Message-ID:	3055.1366044626@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> I found you committed GiST index implementation. That's cool.
> I found an easy way to optimize it. We can also use trigramsMatchGraph for
> signatures. Attached patch contains implementation.

Good idea, committed.

regards, tom lane