Re: pg_trgm partial-match

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_trgm partial-match
Date: 2012-11-22 17:07:58
Message-ID: CAHGQGwGS84ccD+gwEmc34_guV8qbFmC9+ORmZZZG=Qdu-o=TPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 19, 2012 at 7:55 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
>>
>> On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
>> wrote:
>>>
>>> Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
>>> i.e., if query key contains multibyte characters. In this case, byte
>>> length of
>>> the trigram string might be larger than three, and its CRC is used as a
>>> trigram key instead of the trigram string itself. Because of using CRC,
>>> we
>>> cannot do a partial-match. Attached patch extends pg_trgm so that it
>>> compares a partial-match query key only when KEEPONLYALNUM is
>>> enabled.
>>
>>
>> Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
>> character is singlebyte?
>>
>> CREATE TABLE test (val TEXT);
>> INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
>> CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
>> ANALYZE test;
>> test=# SELECT * FROM test WHERE val LIKE '%aa%';
>> val
>> ------
>> aa
>> aaa
>> шaaш
>> (3 rows)
>> test=# set enable_seqscan = off;
>> SET
>> test=# SELECT * FROM test WHERE val LIKE '%aa%';
>> val
>> -----
>> aa
>> aaa
>> (2 rows)
>>
>> I think we can use partial match only for singlebyte encodings. Or, at
>> most, in cases when all alpha-numeric characters are singlebyte (have no
>> idea how to check this).

Good catch! You're right.

> Actually, I also was fiddling around idea of partial match on trigrams when
> I was working on initial LIKE patch. But, I concluded that we would need a
> separate opclass which always keeps full trigram in entry.

Agreed.

My goal is to enable pg_trgm's full-text search using the keyword which
consists of one or two Japanese characters to use an index efficiently.
I first implemented the partial-match patch, and was planning to introduce
new opclass next CommitFest to store the multibyte characters to
the text data instead of int4. But obviously the order of development
should have been the opposite. I will work on the latter development first,
and add new opclass like gin_trgm_mb_ops (mb means multibyte) which
ignores KEEPONLYALNUM and stores the GIN index key as text value.

Regards,

--
Fujii Masao

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-11-22 17:11:37 Re: pg_trgm partial-match
Previous Message Boszormenyi Zoltan 2012-11-22 16:57:01 Re: [PATCH] Make pg_basebackup configure and start standby [Review]