Re: Patch: pg_trgm: gin index scan performance for similarity search

Lists: pgsql-hackers
From: Fornaroli Christophe <cfornaro(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Patch: pg_trgm: gin index scan performance for similarity search
Date: 2015-12-24 15:28:14
Message-ID: CAEnM=mmtpunGtWQTA20wkeNzMrdGViTWMJ33=xmsyzQJPD5VYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I think that we can improve the gin index scan performance for similarity
search defined in the pg_trgm extension. The similarity function is (for
the default case where DIVUNION is defined in the code):

count / (len1 + len2 - count) >= trgm_limit

where
len1 is the number of unique trigrams for the first string,
len2 is the same number for the second string,
count is the number of common trigrams between both strings,
trgm_limit is a user specfied limit in [0, 1].

The code used to determine if a tuple may match the query string is:

case SimilarityStrategyNumber:
/* Count the matches */
ntrue = 0;
for (i = 0; i < nkeys; i++)
{
if (check[i] != GIN_FALSE)
ntrue++;
}
#ifdef DIVUNION
res = (nkeys == ntrue) ? GIN_MAYBE : (((((float4) ntrue) /
((float4) (nkeys - ntrue))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
#else
res = (nkeys == 0) ? GIN_FALSE : (((((float4) ntrue) /
((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
#endif

where
ntrue is the number of common trigrams in both strings,
nkeys is the number of trigrams in the search string.

This code uses this upper bound for the similarity: ntrue / (nkeys -
ntrue). But if there is ntrue trigrams in common, we know that the indexed
string is at least ntrue trigrams long. We can then use a more aggressive
upper bound: ntrue / (ntrue + nkeys - ntrue) or ntrue / nkeys. Attached is
a patch that changes this.

Here are some performance gains with this test case:

create table foo as select
substring(md5(random()::text) for random() * 5) || '123' as bar
from generate_series(1,1000000);

create index on foo using gin (bar gin_trgm_ops);

patched:

test=# explain analyze select count(*) from foo where bar % 'abc123';
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
time=807.434..807.435 rows=1 loops=1)
-> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0)
(actual time=109.893..787.261 rows=54746 loops=1)
Recheck Cond: (bar % 'abc123'::text)
Rows Removed by Index Recheck: 55125
Heap Blocks: exact=4514
-> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
width=0) (actual time=108.456..108.456 rows=109871 loops=1)
Index Cond: (bar % 'abc123'::text)
Planning time: 0.353 ms
Execution time: 807.593 ms
(9 rows)

test=# explain analyze select count(*) from foo where bar % 'abcdef';
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
time=4.829..4.830 rows=1 loops=1)
-> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0)
(actual time=3.512..4.794 rows=5 loops=1)
Recheck Cond: (bar % 'abcdef'::text)
Rows Removed by Index Recheck: 137
Heap Blocks: exact=139
-> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
width=0) (actual time=3.355..3.355 rows=142 loops=1)
Index Cond: (bar % 'abcdef'::text)
Planning time: 0.363 ms
Execution time: 5.061 ms
(9 rows)

master:

test=# explain analyze select count(*) from foo where bar % 'abc123';
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
time=6416.554..6416.554 rows=1 loops=1)
-> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0)
(actual time=484.359..6389.819 rows=54746 loops=1)
Recheck Cond: (bar % 'abc123'::text)
Rows Removed by Index Recheck: 945250
Heap Blocks: exact=4514
-> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
width=0) (actual time=482.677..482.677 rows=999996 loops=1)
Index Cond: (bar % 'abc123'::text)
Planning time: 0.359 ms
Execution time: 6416.945 ms
(9 rows)

test=# explain analyze select count(*) from foo where bar % 'abcdef';
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
time=30.678..30.679 rows=1 loops=1)
-> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0)
(actual time=9.020..30.643 rows=5 loops=1)
Recheck Cond: (bar % 'abcdef'::text)
Rows Removed by Index Recheck: 2789
Heap Blocks: exact=2110
-> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
width=0) (actual time=7.696..7.696 rows=2794 loops=1)
Index Cond: (bar % 'abcdef'::text)
Planning time: 0.254 ms
Execution time: 30.809 ms
(9 rows)

Cheers,

Christophe

Attachment Content-Type Size
pg_trgm-gin-similarity-v1.patch text/x-patch 1.1 KB

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Fornaroli Christophe <cfornaro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch: pg_trgm: gin index scan performance for similarity search
Date: 2015-12-24 18:06:09
Message-ID: CAPpHfduvmuQRzmKUWG-i0EgAw=NhDH=3PfDQ6jdnpsxcSx0GvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Christophe!

On Thu, Dec 24, 2015 at 6:28 PM, Fornaroli Christophe <cfornaro(at)gmail(dot)com>
wrote:

> This code uses this upper bound for the similarity: ntrue / (nkeys -
> ntrue). But if there is ntrue trigrams in common, we know that the indexed
> string is at least ntrue trigrams long. We can then use a more aggressive
> upper bound: ntrue / (ntrue + nkeys - ntrue) or ntrue / nkeys. Attached is
> a patch that changes this.
>

​Good catch, thank you! The estimate in pg_trgm was not optimal.
I think it would be good to add comment which would explicitly state why do
we use this upper bound.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Fornaroli Christophe <cfornaro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: pg_trgm: gin index scan performance for similarity search
Date: 2015-12-25 10:10:44
Message-ID: 567D1624.6090702@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Good catch, committed.

Fornaroli Christophe wrote:
> Hi,
>
> I think that we can improve the gin index scan performance for similarity search
> defined in the pg_trgm extension. The similarity function is (for the default
> case where DIVUNION is defined in the code):
>
> count / (len1 + len2 - count) >= trgm_limit
>
> where
> len1 is the number of unique trigrams for the first string,
> len2 is the same number for the second string,
> count is the number of common trigrams between both strings,
> trgm_limit is a user specfied limit in [0, 1].
>
> The code used to determine if a tuple may match the query string is:
>
> case SimilarityStrategyNumber:
> /* Count the matches */
> ntrue = 0;
> for (i = 0; i < nkeys; i++)
> {
> if (check[i] != GIN_FALSE)
> ntrue++;
> }
> #ifdef DIVUNION
> res = (nkeys == ntrue) ? GIN_MAYBE : (((((float4) ntrue) /
> ((float4) (nkeys - ntrue))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
> #else
> res = (nkeys == 0) ? GIN_FALSE : (((((float4) ntrue) / ((float4)
> nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
> #endif
>
> where
> ntrue is the number of common trigrams in both strings,
> nkeys is the number of trigrams in the search string.
>
> This code uses this upper bound for the similarity: ntrue / (nkeys - ntrue). But
> if there is ntrue trigrams in common, we know that the indexed string is at
> least ntrue trigrams long. We can then use a more aggressive upper bound: ntrue
> / (ntrue + nkeys - ntrue) or ntrue / nkeys. Attached is a patch that changes this.
>
> Here are some performance gains with this test case:
>
> create table foo as select
> substring(md5(random()::text) for random() * 5) || '123' as bar
> from generate_series(1,1000000);
>
> create index on foo using gin (bar gin_trgm_ops);
>
> patched:
>
> test=# explain analyze select count(*) from foo where bar % 'abc123';
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
> time=807.434..807.435 rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=109.893..787.261 rows=54746 loops=1)
> Recheck Cond: (bar % 'abc123'::text)
> Rows Removed by Index Recheck: 55125
> Heap Blocks: exact=4514
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=108.456..108.456 rows=109871 loops=1)
> Index Cond: (bar % 'abc123'::text)
> Planning time: 0.353 ms
> Execution time: 807.593 ms
> (9 rows)
>
> test=# explain analyze select count(*) from foo where bar % 'abcdef';
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual time=4.829..4.830
> rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=3.512..4.794 rows=5 loops=1)
> Recheck Cond: (bar % 'abcdef'::text)
> Rows Removed by Index Recheck: 137
> Heap Blocks: exact=139
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=3.355..3.355 rows=142 loops=1)
> Index Cond: (bar % 'abcdef'::text)
> Planning time: 0.363 ms
> Execution time: 5.061 ms
> (9 rows)
>
>
> master:
>
> test=# explain analyze select count(*) from foo where bar % 'abc123';
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
> time=6416.554..6416.554 rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=484.359..6389.819 rows=54746 loops=1)
> Recheck Cond: (bar % 'abc123'::text)
> Rows Removed by Index Recheck: 945250
> Heap Blocks: exact=4514
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=482.677..482.677 rows=999996 loops=1)
> Index Cond: (bar % 'abc123'::text)
> Planning time: 0.359 ms
> Execution time: 6416.945 ms
> (9 rows)
>
> test=# explain analyze select count(*) from foo where bar % 'abcdef';
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual time=30.678..30.679
> rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=9.020..30.643 rows=5 loops=1)
> Recheck Cond: (bar % 'abcdef'::text)
> Rows Removed by Index Recheck: 2789
> Heap Blocks: exact=2110
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=7.696..7.696 rows=2794 loops=1)
> Index Cond: (bar % 'abcdef'::text)
> Planning time: 0.254 ms
> Execution time: 30.809 ms
> (9 rows)
>
>
> Cheers,
>
> Christophe
>
>
>

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Fornaroli Christophe <cfornaro(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: pg_trgm: gin index scan performance for similarity search
Date: 2015-12-25 10:19:48
Message-ID: CAEnM=mk-5ZiLS+yPQucY9geLtS6oW2hkuWQiyCWmg0k_2ExhYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 25, 2015 at 11:10 AM, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:

> Good catch, committed.
>

Thank you!