Re: Index on regexes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index on regexes
Date: 2013-06-25 12:03:12
Message-ID: 51C98700.2020202@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13.06.2013 23:19, Alexander Korotkov wrote:
> Hackers,
>
> Attached patch contains opclass which demonstrates advantages of GIN
> additional information storing itself without other GIN improvements. It
> implements inversed task of regex indexing. It works so: you create index
> on regexes and search for regexes matched query string. It introduce two
> additional operators |~ and *~ for case-sensetive and case-insensetive
> regex to string matching, and gin_regexp_trgm_ops opclass.

Ok, I've read the patch and I think I understand it now. Clever approach.

The idea is to first build an NFA of trigrams from each indexed regular
expression, just like we do in the reverse case where you index strings,
and the query is a single regexp. The extracted trigrams are inserted
into the gin index.

That already helps significantly in certain kinds of applications. In
fact, I wonder if we should do just that, and leave out the information
stored in the additional info. There are pros and cons - you get a
smaller index, but need to recheck more rows.

But here's the clever part: in addition to just indexing all the
trigrams in each regexp, fragments of the trigram graph are stored along
with each item in the posting lists. In the graph, each arc is labeled
with a trigram, and a string can match the regexp only if there is a
path through the graph following only arcs that match trigrams that are
present in the graph. In the posting list of trigram abc, you store all
the arcs in the graph labelled with 'abc'. In the consistent function,
you have access to all the arcs labelled with trigrams that are present
in the string. The function assembles a graph from those arcs, and then
checks if there is a path from initial to final state.

> Let's consider some example.
>
> At first, generate some regexes.
>
> CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$
> SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer)
> FROM generate_series(1,($1 + random()*$2)::int)), '');
> $$
> LANGUAGE sql;
>
> CREATE TABLE test AS select '(' || generate_string(1,4) || '|' ||
> generate_string(1,4) || '|' || generate_string(1,4) || ')' ||
> generate_string(1,2) || '(' || generate_string(1,4) || '|' ||
> generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM
> generate_series(1,1000000);
>
> I use only 10 characters on alphabet in order to have better chance of
> matching. It generate some regexes like so:
>
> postgres=# SELECT * FROM test LIMIT 10;
> s
> ------------------------------------
> (g|cij|ah)jg(iei|hfc|eef)
> (gbfdb|ehbg|akf)ge(bc|jgee|jidd)
> (jedc|kgc|c)bc(ii|bji|iebc)
> (aa|eie|bgdb)f(fc|he|f)
> (b|ijc|ae)ae(eccb|ie|kjf)
> (bib|igf|kdibf)fij(gcbh|efi|fidj)
> (bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg)
> (kfb|egccd|iefce)jf(kj|jbef|kbc)
> (bhh|c|cd)cb(h|ed|jg)
> (id|j|geg)gc(djif|ai|cjjjc)
> (10 rows)
>
> Without index search takes about 10 seconds.
>
> postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
> Seq Scan on test (cost=0.00..19929.00 rows=5000 width=28) (actual
> time=172.990..97357.312 rows=438 loops=1)
> Filter: (s |~ 'abcdefghijkl'::text)
> Rows Removed by Filter: 999562
> Total runtime: 97357.490 ms
> (4 rows)

That's almost 100 seconds.

> And with index it takes only 110 milliseconds.
>
> postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on test (cost=182.75..7245.94 rows=5000 width=28)
> (actual time=68.143..110.663 rows=438 loops=1)
> Recheck Cond: (s |~ 'abcdefghijkl'::text)
> -> Bitmap Index Scan on test_idx (cost=0.00..181.50 rows=5000 width=0)
> (actual time=67.929..67.929 rows=438 loops=1)
> Index Cond: (s |~ 'abcdefghijkl'::text)
> Total runtime: 110.870 ms
> (5 rows)

That's an impressive result. Now, the next question is, is this a
realistic use of this feature?

Some more numbers from the above test:

* The index in above example is about 180MB in size, while the table is
60MB. Without the graph fragments stored in the "additional info", the
index is about 120MB

* Building the index takes about 370 seconds, or 0.37 ms per row.

I wonder what kind of data sets this feature is likely to be used in
practice? I googled around and found one data set: rules from the Snort
intrusion detection system. I downloaded the community rules from
http://www.snort.org/snort-rules/. They look very different from the
kind of regexps you tested with; the rules contain a lot of strange
characters, so the trigram extraction is not as effective. I had to
modify some of the rules because PostgreSQL regexp engine only supports
up to 255 repetitions with {XXX} syntax, and I also removed all the
modifiers governing case-sensitivity and some other things. I ended up
with 674 regular expressions. Here's a comparison of query speed:

postgres=# explain analyze select * from t where s |~ 'abcdefghijk';
QUERY PLAN

-------------------------------------------------------------------------------
---------------------
Seq Scan on t (cost=0.00..22.43 rows=3 width=123) (actual
time=7534.185..7534
.185 rows=0 loops=1)
Filter: (s |~ 'abcdefghijk'::text)
Rows Removed by Filter: 674
Total runtime: 7534.210 ms
(4 rows)

Time: 7534,620 ms

postgres=# explain analyze select * from t where s |~ 'abcdefghijk';
QUERY PLAN

-------------------------------------------------------------------------------
-----------------------------------------
Bitmap Heap Scan on t (cost=108.03..115.90 rows=3 width=123) (actual
time=705
9.735..7059.735 rows=0 loops=1)
Recheck Cond: (s |~ 'abcdefghijk'::text)
Rows Removed by Index Recheck: 343
-> Bitmap Index Scan on t_gin_idx (cost=0.00..108.03 rows=3
width=0) (actu
al time=37.675..37.675 rows=343 loops=1)
Index Cond: (s |~ 'abcdefghijk'::text)
Total runtime: 7059.788 ms
(6 rows)

Time: 7060,306 ms

So, the index doesn't help much in this case. It's eliminating about
half of the rows, leaving 343 rows to be rechecked, but the total
runtime is about the same.

However, sometimes I get a segfault with that query:

#0 0x00007fede3715988 in checkConnectivity (partGraphInfo=0x4157da0)
at trgm_regexp.c:2262
#1 0x00007fede371286f in gin_regexp_trgm_consistent (fcinfo=0x7fff2a00c230)
at trgm_gin.c:402
#2 0x000000000083386e in FunctionCall10Coll (flinfo=0x36fd5f0,
collation=100, arg1=36110032, arg2=5, arg3=37437664, arg4=13,
arg5=36110288, arg6=36109673, arg7=36110144, arg8=36110000,
arg9=36112304, arg10=36112448) at fmgr.c:1630
#3 0x0000000000472045 in callConsistentFn (ginstate=0x36fc158,
key=0x226fd10)
at ginget.c:55
#4 0x0000000000473f8b in keyGetItem (ginstate=0x36fc158,
tempCtx=0x2b24a60,
key=0x226fd10) at ginget.c:969
#5 0x000000000047416c in scanGetItem (scan=0x4157600,
advancePast=0x7fff2a00c760, item=0x7fff2a00c760,
recheck=0x7fff2a00c75f "\001\377\377\377\377\377\377") at ginget.c:1051
#6 0x0000000000475537 in gingetbitmap (fcinfo=0x7fff2a00c7b0)
at ginget.c:1651

Attached is a dump of the test table I used above. Please note that the
snort-rules are GPLv2-licensed, so they cannot be included in PostgreSQL
e.g in regression tests.

- Heikki

Attachment Content-Type Size
snort-rules text/plain 94.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2013-06-25 12:21:32 Re: Index on regexes
Previous Message Dean Rasheed 2013-06-25 11:58:04 Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]