pg_trgm partial-match

Lists: pgsql-hackers
From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: pg_trgm partial-match
Date: 2012-11-15 19:39:21
Message-ID: CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I'd like to propose to extend pg_trgm so that it can compare a partial-match
query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial'
GIN method for pg_trgm.

Currently, when the query key is less than three characters, we cannot use
a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a
partial-match method. In this case, seq scan or index full scan would be
executed, and its response time would be very slow. I'd like to alleviate this
problem.

Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.

Attached patch is WIP yet. What I should do next is:

* version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc.
* write the regression test

Comments? Review? Objection?

Regards,

--
Fujii Masao

Attachment Content-Type Size
trgm_compare_partial_v0.patch application/octet-stream 5.4 KB

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm partial-match
Date: 2012-11-19 01:56:29
Message-ID: 50A991CD.5090901@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 15.11.2012 20:39, Fujii Masao wrote:
> Hi,
>
> I'd like to propose to extend pg_trgm so that it can compare a partial-match
> query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial'
> GIN method for pg_trgm.
>
> Currently, when the query key is less than three characters, we cannot use
> a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a
> partial-match method. In this case, seq scan or index full scan would be
> executed, and its response time would be very slow. I'd like to alleviate this
> problem.
>
> Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
> i.e., if query key contains multibyte characters. In this case, byte length of
> the trigram string might be larger than three, and its CRC is used as a
> trigram key instead of the trigram string itself. Because of using CRC, we
> cannot do a partial-match. Attached patch extends pg_trgm so that it
> compares a partial-match query key only when KEEPONLYALNUM is
> enabled.
>
> Attached patch is WIP yet. What I should do next is:
>
> * version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc.
> * write the regression test
>
> Comments? Review? Objection?

Hi,

I've done a quick review of the current patch:

(1) It applies cleanly on the current master and builds fine.

(2) In pg_trgm--1.0.sql the gin_trgm_compare_partial is indented
differently (using tabs instead of spaces).

(3) In trgm_gin.c, function gin_extract_value_trgm contains #ifdef
KEEPONLYALNUM, although trgm_pmatch is not used at all.

(4) The patch removes some commented-out variables, but there still
remain various commented-out variables. Will this be cleaned too?

(5) I've done basic functionality of the patch, it really seems to work:

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val LIKE '%aa%';
QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=1655.96..11757.63 rows=141414 width=33)
Recheck Cond: (val ~~ '%aa%'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..1620.61 rows=141414
width=0)
Index Cond: (val ~~ '%aa%'::text)
(4 rows)

Without the patch, this gives a seq scan (as expected).

Do you expect to update the docs too? IMHO it's worth mentioning that
the pg_trgm can handle even patterns shorter than 2 chars ...

regards
Tomas Vondra


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_trgm partial-match
Date: 2012-11-19 06:05:12
Message-ID: CAPpHfdvuQPDgckJZWcgq=ggQGO1whvPQBKp+pEdzyW=4jVyK=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
> i.e., if query key contains multibyte characters. In this case, byte
> length of
> the trigram string might be larger than three, and its CRC is used as a
> trigram key instead of the trigram string itself. Because of using CRC, we
> cannot do a partial-match. Attached patch extends pg_trgm so that it
> compares a partial-match query key only when KEEPONLYALNUM is
> enabled.
>

Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
character is singlebyte?

CREATE TABLE test (val TEXT);
INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
------
aa
aaa
шaaш
(3 rows)
test=# set enable_seqscan = off;
SET
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
-----
aa
aaa
(2 rows)

I think we can use partial match only for singlebyte encodings. Or, at
most, in cases when all alpha-numeric characters are singlebyte (have no
idea how to check this).

------
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_trgm partial-match
Date: 2012-11-19 10:55:08
Message-ID: CAPpHfdtTc2UqLXu98LMxPNdOhuxnSZrPPv2xv8i10SY+CGnaFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com>wrote:

> On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>wrote:
>
>> Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
>> i.e., if query key contains multibyte characters. In this case, byte
>> length of
>> the trigram string might be larger than three, and its CRC is used as a
>> trigram key instead of the trigram string itself. Because of using CRC, we
>> cannot do a partial-match. Attached patch extends pg_trgm so that it
>> compares a partial-match query key only when KEEPONLYALNUM is
>> enabled.
>>
>
> Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
> character is singlebyte?
>
> CREATE TABLE test (val TEXT);
> INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
> CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
> ANALYZE test;
> test=# SELECT * FROM test WHERE val LIKE '%aa%';
> val
> ------
> aa
> aaa
> шaaш
> (3 rows)
> test=# set enable_seqscan = off;
> SET
> test=# SELECT * FROM test WHERE val LIKE '%aa%';
> val
> -----
> aa
> aaa
> (2 rows)
>
> I think we can use partial match only for singlebyte encodings. Or, at
> most, in cases when all alpha-numeric characters are singlebyte (have no
> idea how to check this).
>

Actually, I also was fiddling around idea of partial match on trigrams when
I was working on initial LIKE patch. But, I concluded that we would need a
separate opclass which always keeps full trigram in entry.

------
With best regards,
Alexander Korotkov.


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_trgm partial-match
Date: 2012-11-22 17:07:58
Message-ID: CAHGQGwGS84ccD+gwEmc34_guV8qbFmC9+ORmZZZG=Qdu-o=TPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 19, 2012 at 7:55 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
>>
>> On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
>> wrote:
>>>
>>> Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
>>> i.e., if query key contains multibyte characters. In this case, byte
>>> length of
>>> the trigram string might be larger than three, and its CRC is used as a
>>> trigram key instead of the trigram string itself. Because of using CRC,
>>> we
>>> cannot do a partial-match. Attached patch extends pg_trgm so that it
>>> compares a partial-match query key only when KEEPONLYALNUM is
>>> enabled.
>>
>>
>> Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
>> character is singlebyte?
>>
>> CREATE TABLE test (val TEXT);
>> INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
>> CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
>> ANALYZE test;
>> test=# SELECT * FROM test WHERE val LIKE '%aa%';
>> val
>> ------
>> aa
>> aaa
>> шaaш
>> (3 rows)
>> test=# set enable_seqscan = off;
>> SET
>> test=# SELECT * FROM test WHERE val LIKE '%aa%';
>> val
>> -----
>> aa
>> aaa
>> (2 rows)
>>
>> I think we can use partial match only for singlebyte encodings. Or, at
>> most, in cases when all alpha-numeric characters are singlebyte (have no
>> idea how to check this).

Good catch! You're right.

> Actually, I also was fiddling around idea of partial match on trigrams when
> I was working on initial LIKE patch. But, I concluded that we would need a
> separate opclass which always keeps full trigram in entry.

Agreed.

My goal is to enable pg_trgm's full-text search using the keyword which
consists of one or two Japanese characters to use an index efficiently.
I first implemented the partial-match patch, and was planning to introduce
new opclass next CommitFest to store the multibyte characters to
the text data instead of int4. But obviously the order of development
should have been the opposite. I will work on the latter development first,
and add new opclass like gin_trgm_mb_ops (mb means multibyte) which
ignores KEEPONLYALNUM and stores the GIN index key as text value.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm partial-match
Date: 2012-11-22 17:11:37
Message-ID: CAHGQGwGLasCsWqPM7C7oTy6rQ8hyRQuLSHUM2kGg+ayACWBVkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 19, 2012 at 10:56 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> I've done a quick review of the current patch:

Thanks for the commit!

As Alexander pointed out upthread, another infrastructure patch is required
before applying this patch. So I will implement the infra patch first.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm partial-match
Date: 2013-01-16 16:13:16
Message-ID: CAHGQGwFULhFCWoGUmV6vFp2vn-qYMuNDiY+uuLd-6qZM40PcMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 23, 2012 at 2:11 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Nov 19, 2012 at 10:56 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> I've done a quick review of the current patch:
>
> Thanks for the commit!
>
> As Alexander pointed out upthread, another infrastructure patch is required
> before applying this patch. So I will implement the infra patch first.

I marked this patch as "Returned with Feedback" because unfortunately
I don't have enough time to revise the patch. I will retry this maybe in 9.4dev
phase.

Regards,

--
Fujii Masao