Re: Wildcard search support for pg_trgm

Lists: pgsql-hackers
From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Wildcard search support for pg_trgm
Date: 2010-12-11 21:07:32
Message-ID: AANLkTi=2iL7_ZdQgnm2d1qiNC2rJM-faCVtFu2K5A8Lz@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

Here is first version of patch, which enable index support of wildcard
search in pg_trgm contrib module. The idea of the patch is to extract from
wildcard trigrams which should occurs in wildcard matching string. For
example, for '%sector%' wildcard such trigrams would be: 'sec', 'ect',
'tor'.

create table words (word text);
copy words from '/usr/share/dict/american-english';

test=# explain analyze select * from words where word ilike '%independ%';
QUERY PLAN

------------------------------------------------------------------------------------------------------
Seq Scan on words (cost=0.00..1703.11 rows=10 width=9) (actual
time=18.818..174.146 rows=7 loops=1)
Filter: (word ~~* '%independ%'::text)
Total runtime: 174.200 ms
(3 rows)

CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
QUERY PLAN

------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on words (cost=4.36..40.11 rows=10 width=9) (actual
time=2.445..2.529 rows=7 loops=1)
Recheck Cond: (word ~~* '%independ%'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..4.35 rows=10 width=0)
(actual time=2.406..2.406 rows=7 loops=1)
Index Cond: (word ~~* '%independ%'::text)
Total runtime: 2.612 ms
(5 rows)

CREATE INDEX trgm_idx ON words USING gin (word gin_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on words (cost=76.08..111.83 rows=10 width=9) (actual
time=2.675..2.755 rows=7 loops=1)
Recheck Cond: (word ~~* '%independ%'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..76.07 rows=10 width=0)
(actual time=2.642..2.642 rows=7 loops=1)
Index Cond: (word ~~* '%independ%'::text)
Total runtime: 2.839 ms
(5 rows)

I've encountered with following problems:
1) Indexing support for ilike is possible only with case-insensetive
wildcards, e.g. when IGNORECASE macro is enabled. But I can't use this macro
in pg_trgm.sql.in, where list of operators is defined. Probably, is it
enuogh to put comment near IGNORECASE, which tells that if one disable this
macro he should also remove oparators from pg_trgm.sql.in?
2) I found gist index not very useful with default SIGLENINT = 3. I've
changed this value to 15 and I found gist index performs very good on
dictionary. But on longer strings greater values of SIGLENINT may be
required (probably even SIGLENINT > 122 will give benefit in some cases in
spite of TOAST).

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm_wildcard-0.1.patch text/x-patch 13.6 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Wildcard search support for pg_trgm
Date: 2010-12-12 11:27:25
Message-ID: AANLkTin3Nz_4tCxq-xhYGCxpwVhLHAZYYCR0=Eayv2Y9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I found another problem. GIN index suffers from "GIN indexes do not support
whole-index scans" when no trigram can be extracted from pattern.

----
With best regards,
Alexander Korotkov.


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Wildcard search support for pg_trgm
Date: 2010-12-12 21:14:05
Message-ID: m28vzuwv6q.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Here is first version of patch, which enable index support of wildcard
> search in pg_trgm contrib module.

How different (and better) is it from wildspeed?

http://www.sai.msu.su/~megera/wiki/wildspeed

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Wildcard search support for pg_trgm
Date: 2010-12-12 21:53:26
Message-ID: AANLkTikPF5SC9aWHkmG8s3fL7mC5k87rhTMGZdu6c9Xs@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 13, 2010 at 12:14 AM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr>wrote:

> How different (and better) is it from wildspeed?
>

The general advantage is possibility of usage wildcard search and trigram
similarity search using the same index.
I expect that GIN trigram index is slightly less space demanding, but
slightly slower on search than wildspeed. Also, I expect GiST trigram index
to be slower on search, but faster on updates. While I didn't check these
assumptions in details.
I've lack of test datasets for sufficient testing, and I would like to ask
community to help me with it. Because testing on dictionaries is good,
but obviously
not enough.

----
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Wildcard search support for pg_trgm
Date: 2011-01-08 22:37:06
Message-ID: AANLkTi=_Cu5SMi3kdRS32BheoX4mtLPW=4dzuVNpsGQD@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I updated my patch to make it use full index scan in GIN index which is
possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can
be extracted from, invokes full index scan, which is slow but correct.

test=# explain (analyze, buffers) select * from words where word ilike
'%in%';
QUERY PLAN

------------------------------------------------------------------------------------------------------------
Seq Scan on words (cost=0.00..1703.11 rows=15930 width=9) (actual
time=0.333..225.817 rows=16558 loops=1)
Filter: (word ~~* '%in%'::text)
Buffers: shared read=471
Total runtime: 248.207 ms
(4 rows)

test=# set enable_seqscan = off;
SET
test=# explain (analyze, buffers) select * from words where word ilike
'%in%';
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on words (cost=2287.46..2957.59 rows=15930 width=9)
(actual time=122.239..331.993
rows=16558 loops=1)
Recheck Cond: (word ~~* '%in%'::text)
Buffers: shared hit=472 read=1185
-> Bitmap Index Scan on trgm_idx (cost=0.00..2283.48 rows=15930
width=0) (actual time=122.022..122.022 rows=98569 loops=1)
Index Cond: (word ~~* '%in%'::text)
Buffers: shared hit=1 read=1185
Total runtime: 354.409 ms
(7 rows)

As an alternative solution I can propose to extract null item from every
string and ivoke scan on that item instead of full index scan. It requires
to store additional item per each string but it makes full scan fast.

Also I found a segfault when execute the query above and
switch enable_seqscan few times on line "*searchMode =
GIN_SEARCH_MODE_ALL;". Is it a bug in GIN or I'm missing something?

Here goes backtrace from gdb:
#0 0xb4ead070 in gin_extract_query_trgm (fcinfo=0xbfcd8da8) at
trgm_gin.c:112
#1 0x08323a84 in OidFunctionCall5 (functionId=32802, arg1=161269768,
arg2=3217920208, arg3=4,
arg4=3217920204, arg5=3217920200) at fmgr.c:1687
#2 0x082c5654 in gincostestimate (fcinfo=0xbfcd9148) at selfuncs.c:6466
#3 0x083235d8 in OidFunctionCall9 (functionId=2741, arg1=161270176,
arg2=161271296,
arg3=161824624, arg4=0, arg5=0, arg6=3217921064, arg7=3217921056,
arg8=3217921048,
arg9=3217921040) at fmgr.c:1840
#4 0x081f3397 in cost_index (path=0x9a55050, root=0x99cc9a0,
index=0x99cce00,
indexQuals=0x9a53f70, indexOrderBys=0x0, outer_rel=0x0) at
costsize.c:268
#5 0x08216b66 in create_index_path (root=0x99cc9a0, index=0x99cce00,
clause_groups=0x9a53f88,
indexorderbys=0x0, pathkeys=0x0, indexscandir=NoMovementScanDirection,
outer_rel=0x0)
at pathnode.c:511
#6 0x081f7ef5 in find_usable_indexes (root=<value optimized out>,
rel=<value optimized out>,
clauses=<value optimized out>, outer_clauses=0x0, istoplevel=1 '\001',
outer_rel=0x0,
saop_control=SAOP_FORBID, scantype=ST_ANYSCAN) at indxpath.c:422
#7 0x081f8e38 in create_index_paths (root=0x99cc9a0, rel=0x99ccc30) at
indxpath.c:176
#8 0x081eec22 in set_plain_rel_pathlist (root=<value optimized out>,
rel=<value optimized out>,
rti=<value optimized out>, rte=0x99cc650) at allpaths.c:262
#9 set_rel_pathlist (root=<value optimized out>, rel=<value optimized
out>,
rti=<value optimized out>, rte=0x99cc650) at allpaths.c:202
#10 0x081efa55 in set_base_rel_pathlists (root=0x99cc9a0,
joinlist=0x99ccde8) at allpaths.c:158
#11 make_one_rel (root=0x99cc9a0, joinlist=0x99ccde8) at allpaths.c:94
#12 0x08203ef7 in query_planner (root=0x99cc9a0, tlist=0x99ccb00,
tuple_fraction=0,
limit_tuples=-1, cheapest_path=0xbfcd98cc, sorted_path=0xbfcd98c8,
num_groups=0xbfcd98c0)
at planmain.c:271
#13 0x08205b86 in grouping_planner (root=0x99cc9a0, tuple_fraction=0) at
planner.c:1182
#14 0x08207609 in subquery_planner (glob=0x99cc910, parse=0x99cc5a0,
parent_root=0x0,
hasRecursion=0 '\000', tuple_fraction=0, subroot=0xbfcd9a7c) at
planner.c:536
#15 0x08207ca6 in standard_planner (parse=0x99cc5a0, cursorOptions=0,
boundParams=0x0)
at planner.c:201
#16 0x0825db11 in pg_plan_query (querytree=0x99cc5a0, cursorOptions=0,
boundParams=0x0)
at postgres.c:764
#17 0x0815a824 in ExplainOneQuery (stmt=0x9a258e0,
queryString=0x9a24c60 "explain (analyze, buffers) select * from words
where word ilike '%in%';",---Type <return> to continue, or q <return> to
quit---
params=0x0, dest=0x9a32330) at explain.c:300
#18 ExplainQuery (stmt=0x9a258e0,
queryString=0x9a24c60 "explain (analyze, buffers) select * from words
where word ilike '%in%';", params=0x0, dest=0x9a32330) at explain.c:209
#19 0x08261266 in PortalRunUtility (portal=0x9a4d6a8,
utilityStmt=0x9a258e0,
isTopLevel=<value optimized out>, dest=0x9a32330,
completionTag=0xbfcd9bcc "") at pquery.c:1191
#20 0x082622a4 in FillPortalStore (portal=0x9a4d6a8, isTopLevel=32 ' ') at
pquery.c:1065
#21 0x0826281a in PortalRun (portal=0x9a4d6a8, count=2147483647,
isTopLevel=-56 '\310',
dest=0x9a25ee8, altdest=0x9a25ee8, completionTag=0xbfcd9d9c "") at
pquery.c:791
#22 0x0825ec78 in exec_simple_query (query_string=<value optimized out>) at
postgres.c:1059
#23 0x0825fe01 in PostgresMain (argc=2, argv=0x99aeb68, username=0x99aead8
"smagen")
at postgres.c:3939
#24 0x08229250 in BackendRun () at postmaster.c:3577
#25 BackendStartup () at postmaster.c:3262
#26 ServerLoop () at postmaster.c:1448
#27 0x0822be12 in PostmasterMain (argc=3, argv=0x99acc58) at
postmaster.c:1109
#28 0x081ce3b7 in main (argc=3, argv=0x99acc58) at main.c:199

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm_wildcard-0.2.patch text/x-patch 16.9 KB

From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Wildcard search support for pg_trgm
Date: 2011-01-14 20:29:39
Message-ID: 4D30B233.6000909@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/01/11 23:37, Alexander Korotkov wrote:
> I updated my patch to make it use full index scan in GIN index which is
> possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can
> be extracted from, invokes full index scan, which is slow but correct.

Hi,

unfortunately, this change made the patch not apply:
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=be0c3ea2d30ba225f0249ae88d6b0bdf3b753162

I'm getting rejects in trgm_gin.c. Could you update the patch please?

Cheers,
Jan


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Wildcard search support for pg_trgm
Date: 2011-01-17 09:03:57
Message-ID: AANLkTikVxx6_AjZB52OnA7MBzijCPDqSZoMCD-3u1ATq@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Here is updated version of this patch.

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm_wildcard-0.3.patch text/x-patch 16.5 KB