Quick Links

Re: trgm regex index peculiarity

Lists:	pgsql-hackers

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	trgm regex index peculiarity
Date:	2013-06-21 01:38:24
Message-ID:	786ef170d3cb6b991425a8b364e77c57.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

9.4devel (but same in 9.3)

In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these timings:

select txt from azjunk6 where txt ~ '^abcd';
130 ms

select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 ms

(a similar performance difference occurs when using a regex, i.e. 'abc[de]' )

This difference is so large that I wonder if there is not something wrong in the first case. (The returned results are
correct though)

Here are the two explains:

explain analyze select txt from azjunk6 where txt ~ '^abcd';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=108.78..484.93 rows=100 width=81) (actual time=129.557..129.742 rows=1 loops=1)
Recheck Cond: (txt ~ '^abcd'::text)
Rows Removed by Index Recheck: 17
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..108.75 rows=100 width=0) (actual time=129.503..129.503 rows=18
loops=1)
Index Cond: (txt ~ '^abcd'::text)
Total runtime: 130.008 ms
(6 rows)

explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=56.75..433.40 rows=1 width=81) (actual time=2.064..3.379 rows=1 loops=1)
Recheck Cond: (txt ~ 'abcd'::text)
Rows Removed by Index Recheck: 14
Filter: (substr(txt, 1, 4) = 'abcd'::text)
Rows Removed by Filter: 112
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..56.75 rows=100 width=0) (actual time=1.911..1.911 rows=127
loops=1)
Index Cond: (txt ~ 'abcd'::text)
Total runtime: 3.409 ms
(8 rows)

The results in both cases are correct, but does this difference not almost amount to a bug?

( Interestingly, the variant WHERE txt ~ 'abcd$'
is as fast as the non-anchored variant )

Thanks,

Erik Rijkers

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Erik Rijkers" <er(at)xs4all(dot)nl>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: trgm regex index peculiarity
Date:	2013-06-21 03:25:58
Message-ID:	3920.1371785158@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Erik Rijkers" <er(at)xs4all(dot)nl> writes:
> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these timings:
> select txt from azjunk6 where txt ~ '^abcd';
> 130 ms
> select txt from azjunk6
> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
> 3 ms

Hm, could you provide a self-contained test case?

regards, tom lane

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: trgm regex index peculiarity
Date:	2013-06-21 10:40:38
Message-ID:	dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, June 21, 2013 05:25, Tom Lane wrote:
> "Erik Rijkers" <er(at)xs4all(dot)nl> writes:
>> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these
>> timings:
>> select txt from azjunk6 where txt ~ '^abcd';
>> 130 ms
>> select txt from azjunk6
>> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>> 3 ms
>
> Hm, could you provide a self-contained test case?
>

yes, sorry. I tested on a 1M row table:

#!/bin/sh

# create table:
for power in 6;
do
table=azjunk${power}
index=${table}_trgm_re_idx
perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \
| psql -aqXc "
drop table if exists $table;
create table $table(txt text);
copy $table from stdin;";
echo "set session maintenance_work_mem='1GB';
create index $index on $table using gin (txt gin_trgm_ops);
analyze $table;" | psql -qtAX;
done

# test:
echo "
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
" | psql -Xqa

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2013-06-21 13:11:01
Message-ID:	CAPpHfdu1+Y6mvrg2PU5raEE5efziVKztbR30Y1nR=Fz_bL6eyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Fri, June 21, 2013 05:25, Tom Lane wrote:
> > "Erik Rijkers" <er(at)xs4all(dot)nl> writes:
> >> In a 112 MB test table (containing random generated text) with a trgm
> index (gin_trgm_ops), I consistently get these
> >> timings:
> >> select txt from azjunk6 where txt ~ '^abcd';
> >> 130 ms
> >> select txt from azjunk6
> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
> >> 3 ms
> >
> > Hm, could you provide a self-contained test case?
> >
>
> yes, sorry. I tested on a 1M row table:
>
> #!/bin/sh
>
> # create table:
> for power in 6;
> do
> table=azjunk${power}
> index=${table}_trgm_re_idx
> perl -E'
> sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
> say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 ..
> 1e'"${power};" \
> | psql -aqXc "
> drop table if exists $table;
> create table $table(txt text);
> copy $table from stdin;";
> echo "set session maintenance_work_mem='1GB';
> create index $index on $table using gin (txt gin_trgm_ops);
> analyze $table;" | psql -qtAX;
> done
>
> # test:
> echo "
> \\timing on
> explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140
> ms)
> explain analyze select txt from azjunk6 where txt ~ 'abcd' and
> substr(txt,1,4) = 'abcd'; -- fast (5 ms)
> " | psql -Xqa

Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
'__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab'
is more frequent than another trigrams because we don't know their
distribution. However, simple assumption that blank character (in pg_trgm
meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE
macro which is used in selectColorTrigrams like count of characters in
COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into
WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove
color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we
are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm_regex_optimize.1.patch	application/octet-stream	5.3 KB

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2013-06-21 13:39:50
Message-ID:	657ec9f92df7fc15c8a586b1fe355c46.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
> On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
>> On Fri, June 21, 2013 05:25, Tom Lane wrote:
>> > "Erik Rijkers" <er(at)xs4all(dot)nl> writes:
>> >> In a 112 MB test table (containing random generated text) with a trgm
>> index (gin_trgm_ops), I consistently get these
>> >> timings:
>> >> select txt from azjunk6 where txt ~ '^abcd';
>> >> 130 ms
>> >> select txt from azjunk6
>> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>> >> 3 ms
>> >
>
> Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
> However trigrams '__a' is much more frequent than '_ab' which in turn is
> much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
> '__a' and '_ab' and that gives so significant speedup.

> [trgm_regex_optimize.1.patch ]

Yes, that fixes the problem, thanks.

Erik Rijkers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-01-15 13:35:56
Message-ID:	CAPpHfduREkHNcsrs0cXGrFRj=LQ8-N7DOrJYo+3BNFcZaWcLLw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 21, 2013 at 5:39 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:

> On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
> > On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> >
> >> On Fri, June 21, 2013 05:25, Tom Lane wrote:
> >> > "Erik Rijkers" <er(at)xs4all(dot)nl> writes:
> >> >> In a 112 MB test table (containing random generated text) with a trgm
> >> index (gin_trgm_ops), I consistently get these
> >> >> timings:
> >> >> select txt from azjunk6 where txt ~ '^abcd';
> >> >> 130 ms
> >> >> select txt from azjunk6
> >> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
> >> >> 3 ms
> >> >
> >
> > Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and
> 'bcd'.
> > However trigrams '__a' is much more frequent than '_ab' which in turn is
> > much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting
> of
> > '__a' and '_ab' and that gives so significant speedup.
>
> > [trgm_regex_optimize.1.patch ]
>
> Yes, that fixes the problem, thanks.
>

Revised version of patch with necessary comments.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regex-optimize.2.patch	application/octet-stream	9.0 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-01-15 23:34:16
Message-ID:	4963.1389828856@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Revised version of patch with necessary comments.

I looked at this patch a bit. It seems like this:

+ * BLANK_COLOR_SIZE - How much blank character is more frequent than
+ * other character in average
+ #define BLANK_COLOR_SIZE 32

is a drastic overestimate. Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters. Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...

Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared. Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter. I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.

Also, surely this bit:

! trgmNFA->totalTrgmCount = (int) totalTrgmSize;

is flat out wrong? expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction. The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.

I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.

Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions. Because of the padding
settings, any one word will normally lead to padded strings " a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions. (It's a little less
than that because single-letter words produce only " a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.) So I'm thinking the penalty ought to take that
into account.

I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate. We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.

Another point is that this comment:

* Note: per-color-trigram counts cannot overflow an int so long as
* COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.

is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.

There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it. That's about it for substantive comments though.

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-02-09 19:28:54
Message-ID:	CAPpHfdtca-Jh5FX=HGG7D0Q9nf0SBK2FSfGr4mo1vKfSyGGgCw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > Revised version of patch with necessary comments.
>
> I looked at this patch a bit. It seems like this:
>
> + * BLANK_COLOR_SIZE - How much blank character is more frequent than
> + * other character in average
> + #define BLANK_COLOR_SIZE 32
>
> is a drastic overestimate. Using the program Erik provided to generate
> some sample data, I find that blanks in trigrams are maybe about three
> times more common than other characters. Possibly Erik's data isn't
> very representative of real text, but if that's the case we'd better
> get a more representative sample case before we start optimizing ...
>
> Now maybe the above number is okay for this purpose anyway, but if so
> it needs some different justification; maybe call it a "penalty"?
> But nonetheless it seems like a darn large penalty, particularly for " x"
> type trigrams where the value would effectively be squared. Compared to
> the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
> you might as well forget the "sizing" approach and just flat out reject
> trigrams containing COLOR_BLANK, because they'll never get through the
> size filter.

It seems to be right decision to me when we have other trigrams can reject
them. But there are cases when we can't.

> I'm inclined to think you need a larger MAX_TRGM_SIZE;
> and WISH_TRGM_SIZE seems mighty small as well, compared to what the
> code would have done before.
>

OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams

Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)

The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)

In order to estimate n I've analyzed some classics:
Ernest Hemingway "A farewell to arms" - 3.8
Leo Tolstoy "War and Peace" - 5.0

In english with alphabet size = 26 we have

__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4

In russian with alphabet size = 33 we have

__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11

Assuming these calculations is approximate, we can safely use same values
for both languages.
Does such reasoning make sense?

Also, surely this bit:
>
> ! trgmNFA->totalTrgmCount = (int) totalTrgmSize;
>
> is flat out wrong? expandColorTrigrams() expects that
> trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
> abstraction. The fact that the patch hasn't failed on you completely
> demonstrates that you've not tested any cases where blank-containing
> trigrams got through the filter, reinforcing my feeling that it's
> probably too strict now.
>

Oh, I wonder how can I leave such weird bug in patch :-(

> I wonder if it would be more appropriate to continue to measure the limit
> MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
> penalized "size" as the sort key while determining which color trigrams
> to discard first.
>

Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.

Another thought is that it's not clear that you should apply the same
> penalty to blanks in all three positions. Because of the padding
> settings, any one word will normally lead to padded strings " a",
> " ab", "yz ", so that spaces in the first position are about twice
> as common as spaces in the other positions. (It's a little less
> than that because single-letter words produce only " a" and " a ";
> but I'd think that such words aren't very common in text that trigrams
> are effective for.) So I'm thinking the penalty ought to take that
> into account.
>
> I'm also inclined to think that it might be worth adding a size
> field to ColorTrgmInfo rather than repeatedly calculating the
> size estimate. We only allow a thousand and some ColorTrgmInfos
> at most, so the extra space wouldn't be that large, and I'm concerned
> by the expense the current patch adds to the sort comparator.
>

Agree.

>
> Another point is that this comment:
>
> * Note: per-color-trigram counts cannot overflow an int so long as
> * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie
> about
> * 1290. However, the grand total totalTrgmCount might conceivably
> * overflow an int, so we use int64 for that within this routine.
>
> is no longer valid, or at least it fails to demonstrate that the result
> of getColorTrigramSize() can't overflow an int; indeed I rather fear it
> can.
> The patch failed to even update the variable name in this comment, let
> alone address the substantive question.
>
> There are some other cosmetic things I didn't like, for instance
> colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
> rename it. That's about it for substantive comments though.
>

Thanks, will be fixed.

------
With best regards,
Alexander Korotkov.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-02-09 21:01:59
Message-ID:	27899.1391979719@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I looked at this patch a bit. It seems like this:
>> + * BLANK_COLOR_SIZE - How much blank character is more frequent than
>> + * other character in average
>> + #define BLANK_COLOR_SIZE 32
>> is a drastic overestimate.

> OK, I would like to make more reasoning for penalty.
> Let's consider word of length n.
> It contains n+1 trigrams including:
> 1 __x trigram
> 1 _xx trigram
> 1 xx_ trigram
> n - 2 xxx trigrams

> Assume alphabet of size m those trigrams have following average frequencies:
> __x 1/((n+1)*m)
> _xx 1/((n+1)*m^2)
> xx_ 1/((n+1)*m^2)
> xxx (n-2)/((n+1)*m^3)

Hmm, but you're assuming that the m letters are all equally frequent,
which is surely not true in normal text.

I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that. I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one (' t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form ' x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.

So, we've rediscovered the fact that 'the' is the most common word in
English text ;-). But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.

So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams). Even using your numbers,
it shouldn't be 32.

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-03-01 16:41:09
Message-ID:	CAPpHfdvjEVC8mKXXxB-VDVCm-Zhj3OLMG7zNECO7QT2Xnz_38Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I looked at this patch a bit. It seems like this:
> >> + * BLANK_COLOR_SIZE - How much blank character is more frequent
> than
> >> + * other character in average
> >> + #define BLANK_COLOR_SIZE 32
> >> is a drastic overestimate.
>
> > OK, I would like to make more reasoning for penalty.
> > Let's consider word of length n.
> > It contains n+1 trigrams including:
> > 1 __x trigram
> > 1 _xx trigram
> > 1 xx_ trigram
> > n - 2 xxx trigrams
>
> > Assume alphabet of size m those trigrams have following average
> frequencies:
> > __x 1/((n+1)*m)
> > _xx 1/((n+1)*m^2)
> > xx_ 1/((n+1)*m^2)
> > xxx (n-2)/((n+1)*m^3)
>
> Hmm, but you're assuming that the m letters are all equally frequent,
> which is surely not true in normal text.
>
> I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
> but I do have a text file with the collected works of H.P. Lovecraft,
> so I tried analyzing the trigrams in that. I find that
> * The average word length is 4.78 characters.
> * There are 5652 distinct trigrams (discounting some containing digits),
> the most common one (' t') occurring 81222 times; the average
> occurrence count is 500.0.
> * Considering only trigrams not containing any blanks, there are
> 4937 of them, the most common one ('the') occurring 45832 times,
> with an average count of 266.9.
> * There are (unsurprisingly) exactly 26 trigrams of the form ' x',
> with an average count of 19598.5.
> * There are 689 trigrams of the form ' xx' or 'xx ', the most common one
> (' th') occurring 58200 times, with an average count of 1450.1.
>
> So, we've rediscovered the fact that 'the' is the most common word in
> English text ;-). But I think the significant conclusion for our purposes
> here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
> times more common than the average space-free trigram, and two-space
> trigrams about 19598.5/266.9 = 73.4 times more common.
>
> So this leads me to the conclusion that we should be using a
> BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
> a square-law rule for two-space trigrams). Even using your numbers,
> it shouldn't be 32.
>

Next revision of patch is attached. Changes are so:
1) Notion "penalty" is used instead of "size".
2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
MAX_TRGM_COUNT total trigrams count.
3) Penalties are assigned to particular color trigram classes. I.e.
separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
trigram frequencies in Oscar Wilde writings. We can end up with different
numbers, but I don't think they will be dramatically different.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
trgm-regex-optimize.3.patch	application/octet-stream	9.5 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-03-28 08:31:15
Message-ID:	53353353.6080408@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I went back and tried Erik's original test
(http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl).
With a fresh checkout from master, the difference between the slow and
fast queries is much less dramatic than Erik reported. The reason is
that Alexander's GIN "fast scan" patch is very effective on that query.
Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On
my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms,
and with fast scan disabled (by modifying the source code), 40ms vs 1ms.

So thanks to the fast scan patch, I don't think this patch is worth
pursuing anymore. Unless there are some other test case where this patch
helps, but the fast scan patch doesn't.

- Heikki

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Cc:	"Alexander Korotkov" <aekorotkov(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-03-28 22:28:06
Message-ID:	32d687d2a55963c74281327fffcf7abb.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
> I went back and tried Erik's original test
> (http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl).
> With a fresh checkout from master, the difference between the slow and
> fast queries is much less dramatic than Erik reported. The reason is
> that Alexander's GIN "fast scan" patch is very effective on that query.
> Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On
> my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms,
> and with fast scan disabled (by modifying the source code), 40ms vs 1ms.
>
> So thanks to the fast scan patch, I don't think this patch is worth
> pursuing anymore. Unless there are some other test case where this patch
> helps, but the fast scan patch doesn't.
>

for the same 2 statements of my original test:
explain (analyze,buffers) select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms)
explain (analyze,buffers) select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)

You mention (from HEAD, I suppose?) a difference of 2.5 ms vs. 1 ms.

FWIW, for me the difference (from HEAD) remains quite a bit larger:

for n in `seq 1 10`; do ./trgm_peculiarity.sh ; done | grep runtime

Total runtime: 16.167 ms
Total runtime: 2.188 ms
Total runtime: 16.902 ms
Total runtime: 2.203 ms
Total runtime: 17.486 ms
Total runtime: 2.201 ms
Total runtime: 17.663 ms
Total runtime: 2.441 ms
Total runtime: 13.555 ms
Total runtime: 2.204 ms
Total runtime: 16.862 ms
Total runtime: 2.225 ms
Total runtime: 13.207 ms
Total runtime: 2.550 ms
Total runtime: 16.768 ms
Total runtime: 2.172 ms
Total runtime: 19.259 ms
Total runtime: 2.180 ms
Total runtime: 12.934 ms
Total runtime: 2.198 ms

That's a lot better than the original 140ms vs 5ms but your laptop's 2.5 ms vs. 1 ms is perhaps not representative.

(for the full plans see below)

Erik Rijkers

----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=56.77..432.93 rows=100 width=81) (actual time=15.898..15.925 rows=2 loops=1)
Recheck Cond: (txt ~ '^abcd'::text)
Rows Removed by Index Recheck: 11
Heap Blocks: exact=13
Buffers: shared hit=105
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..56.75 rows=100 width=0) (actual time=15.834..15.834 rows=13
loops=1)
Index Cond: (txt ~ '^abcd'::text)
Buffers: shared hit=92
Planning time: 3.304 ms
Total runtime: 16.179 ms
(10 rows)

Time: 21.103 ms
explain (analyze,buffers) select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=28.75..405.40 rows=1 width=81) (actual time=1.681..2.164 rows=2 loops=1)
Recheck Cond: (txt ~ 'abcd'::text)
Rows Removed by Index Recheck: 11
Filter: (substr(txt, 1, 4) = 'abcd'::text)
Rows Removed by Filter: 101
Heap Blocks: exact=113
Buffers: shared hit=120
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..28.75 rows=100 width=0) (actual time=1.171..1.171 rows=114
loops=1)
Index Cond: (txt ~ 'abcd'::text)
Buffers: shared hit=7
Planning time: 0.516 ms
Total runtime: 2.183 ms
(12 rows)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Erik Rijkers" <er(at)xs4all(dot)nl>
Cc:	"Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "Alexander Korotkov" <aekorotkov(at)gmail(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-04-05 18:23:29
Message-ID:	17221.1396722209@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Erik Rijkers" <er(at)xs4all(dot)nl> writes:
> On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
>> So thanks to the fast scan patch, I don't think this patch is worth
>> pursuing anymore. Unless there are some other test case where this patch
>> helps, but the fast scan patch doesn't.

> FWIW, for me the difference (from HEAD) remains quite a bit larger:
> ...
> That's a lot better than the original 140ms vs 5ms but your laptop's 2.5 ms vs. 1 ms is perhaps not representative.

I also feel that Alexander's patch is still worth pursuing. What I see
(in an assert-disabled build of HEAD) is that Erik's "slow" query takes
about 2 ms to plan and 5.5 ms to execute, while the "fast" query is about
0.7 ms to plan and 1.3 ms to execute. With the patch, the "slow" query is
still about 2 ms to plan but only 0.3 ms to execute, while the "fast"
query doesn't change much. There's a lot of noise in these numbers
(successive executions seem to be bouncing around more than usual today),
but still it looks like the runtime gain is impressive percentagewise.

One thing that's interesting is that the planning time is so much more for
the "slow" query, even though the "fast" one has an additional WHERE
clause that the planner has to think about. I haven't tried to do perf
measurements to be sure, but my guess is that this has nothing to do with
GIN or pg_trgm directly, but represents the planner's efforts to get a
selectivity estimate for the ~ operator --- that code works much
differently for patterns that define fixed prefixes vs those that don't.

Anyway, the important point here is that I think the planning time
is helping to mask the fact that there's a pretty useful runtime
improvement.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: trgm regex index peculiarity
Date:	2014-04-06 00:52:09
Message-ID:	29409.1396745529@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Next revision of patch is attached. Changes are so:
> 1) Notion "penalty" is used instead of "size".
> 2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
> MAX_TRGM_COUNT total trigrams count.
> 3) Penalties are assigned to particular color trigram classes. I.e.
> separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
> trigram frequencies in Oscar Wilde writings. We can end up with different
> numbers, but I don't think they will be dramatically different.

Committed with cosmetic improvements (adjusting the comments mostly).

The new whitespace penalties look reasonably sane to me. I wonder though
if WISH_TRGM_PENALTY is too small --- it seems like this code will tend to
select many fewer trigrams than the old code did. What testing did you do
that led you to select the specific value of 16?

regards, tom lane