Re: trgm regex index peculiarity

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trgm regex index peculiarity
Date: 2013-06-21 13:11:01
Message-ID: CAPpHfdu1+Y6mvrg2PU5raEE5efziVKztbR30Y1nR=Fz_bL6eyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Fri, June 21, 2013 05:25, Tom Lane wrote:
> > "Erik Rijkers" <er(at)xs4all(dot)nl> writes:
> >> In a 112 MB test table (containing random generated text) with a trgm
> index (gin_trgm_ops), I consistently get these
> >> timings:
> >> select txt from azjunk6 where txt ~ '^abcd';
> >> 130 ms
> >> select txt from azjunk6
> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
> >> 3 ms
> >
> > Hm, could you provide a self-contained test case?
> >
>
> yes, sorry. I tested on a 1M row table:
>
> #!/bin/sh
>
> # create table:
> for power in 6;
> do
> table=azjunk${power}
> index=${table}_trgm_re_idx
> perl -E'
> sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
> say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 ..
> 1e'"${power};" \
> | psql -aqXc "
> drop table if exists $table;
> create table $table(txt text);
> copy $table from stdin;";
> echo "set session maintenance_work_mem='1GB';
> create index $index on $table using gin (txt gin_trgm_ops);
> analyze $table;" | psql -qtAX;
> done
>
> # test:
> echo "
> \\timing on
> explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140
> ms)
> explain analyze select txt from azjunk6 where txt ~ 'abcd' and
> substr(txt,1,4) = 'abcd'; -- fast (5 ms)
> " | psql -Xqa

Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
'__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab'
is more frequent than another trigrams because we don't know their
distribution. However, simple assumption that blank character (in pg_trgm
meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE
macro which is used in selectColorTrigrams like count of characters in
COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into
WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove
color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we
are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm_regex_optimize.1.patch application/octet-stream 5.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-21 13:21:07 Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Previous Message Michael Paquier 2013-06-21 11:54:34 Re: Support for REINDEX CONCURRENTLY