Re: Can pg_trgm handle non-alphanumeric characters?

Lists: pgsql-hackers
From: "MauMau" <maumau307(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-09 12:10:07
Message-ID: 4906C797079C43C0BD2DA41D438975C6@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

This question may be appropriate for pgsql-general, but let me ask here
because the only relevant discussion seems to have been done on
pgsql-hackers:

http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php

Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM
in contrib/pg_trgm/trgm.h? If no, what kind of problems would happen?

Regards
MauMau


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-09 17:12:29
Message-ID: CAHGQGwGSKRn18ug54aAWYEd_ECsgW76aXL5pETz-MY-fYoWeeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
> Hello,
>
> This question may be appropriate for pgsql-general, but let me ask here
> because the only relevant discussion seems to have been done on
> pgsql-hackers:
>
> http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php
>
> Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM
> in contrib/pg_trgm/trgm.h?

Yes unless I'm missing something.

Regards,

--
Fujii Masao


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-09 22:17:35
Message-ID: EDD9E3244A6F43A68304F76CE48937E1@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
> On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> This question may be appropriate for pgsql-general, but let me ask here
>> because the only relevant discussion seems to have been done on
>> pgsql-hackers:
>>
>> http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php
>>
>> Can pg_trgm support non-alphanumeric characters by undefining
>> KEEPONLYALNUM
>> in contrib/pg_trgm/trgm.h?
>
> Yes unless I'm missing something.

Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would
it cause any problems? If no, I wish that, because it eliminates the need to
do the removal every time the users applies minor releases.

Regards
MauMau


From: Euler Taveira <euler(at)timbira(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-09 23:18:54
Message-ID: 4FAAFB5E.2030602@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09-05-2012 19:17, MauMau wrote:
> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it
> cause any problems? If no, I wish that, because it eliminates the need to do
> the removal every time the users applies minor releases.
>
If you do so, you'll break minor versions. IMHO the default is the desirable
behavior for almost all use cases (you are the first one that complain about
it). Maybe in the future, we should be able to flip this flag without
rebuilding binaries.

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Euler Taveira <euler(at)timbira(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-09 23:52:04
Message-ID: CAHGQGwHFnSsq4zupvYYu1ca6o7xMoZ_fTnEJ=9BnvMxWkM2GQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler(at)timbira(dot)com> wrote:
> On 09-05-2012 19:17, MauMau wrote:
>> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it
>> cause any problems? If no, I wish that, because it eliminates the need to do
>> the removal every time the users applies minor releases.
>>
> If you do so, you'll break minor versions.

Right. And removing KEEPONLYALNUM is a feature change rather than bug fix,
so that should be proposed during major version development cycle.

> IMHO the default is the desirable
> behavior for almost all use cases (you are the first one that complain about
> it).

Really? I was thinking non-English users (including me) basicaly would not be
satisfied with the default because they cannot use pg_trgm for N-gram full text
search of non-English text. Though I agree some users would prefer the default.

> Maybe in the future, we should be able to flip this flag without
> rebuilding binaries.

Agreed.

Regards,

--
Fujii Masao


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Euler Taveira" <euler(at)timbira(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-10 11:45:31
Message-ID: 6A03E28C25F24EA1AD941E37DDFD90D5@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
> On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler(at)timbira(dot)com> wrote:
>> On 09-05-2012 19:17, MauMau wrote:
>>> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4?
>>> Would it
>>> cause any problems? If no, I wish that, because it eliminates the need
>>> to do
>>> the removal every time the users applies minor releases.
>>>
>> If you do so, you'll break minor versions.
>
> Right. And removing KEEPONLYALNUM is a feature change rather than bug fix,
> so that should be proposed during major version development cycle.

For information, what kind of breakage would occur? Is it performance
degradation, extra index storage consumption, or undesirable query results?
I imagined removing KEEPONLYALNUM would just accept non-alphanumeric
characters and cause no harm to those who use only alphanumeric characters.

Regards
MauMau


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "MauMau" <maumau307(at)gmail(dot)com>,"Euler Taveira" <euler(at)timbira(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-10 14:43:08
Message-ID: 4FAB8DAC0200002500047ACD@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"MauMau" <maumau307(at)gmail(dot)com> wrote:

>>> On 09-05-2012 19:17, MauMau wrote:
>>>> Then, does it make sense to remove "#define KEEPONLYALNUM" in
>>>> 9.1.4? Would it cause any problems?

Yes, it will cause problems.

> For information, what kind of breakage would occur?

> I imagined removing KEEPONLYALNUM would just accept
> non-alphanumeric characters and cause no harm to those who use
> only alphanumeric characters.

This would break our current usages because of the handling of
trigrams at the "edges" of groups of qualifying characters. It
would make similarity (and distance) values less useful for our
current name searches using it. To simulate the effect, I used an
'8' in place of a comma instead of recompiling with the suggested
change.

test=# select show_trgm('smith,john');
show_trgm
-----------------------------------------------------------
{" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
(1 row)

test=# select show_trgm('smith8john');
show_trgm
-----------------------------------------------------
{" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
(1 row)

test=# select similarity('smith,john', 'jon smith');
similarity
------------
0.615385
(1 row)

test=# select similarity('smith8john', 'jon smith');
similarity
------------
0.3125
(1 row)

So making the proposed change unconditionally could indeed hurt
current users of the technique. On the other hand, if there was
fine-grained control of this, it might make trigrams useful for
searching statute cites (using all characters) as well as names
(using the current character set); so I wouldn't want it to just be
controlled by a global GUC.

-Kevin


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Euler Taveira" <euler(at)timbira(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-10 15:07:59
Message-ID: DD0DD117F67E48E5961C326E5C050A3E@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
> "MauMau" <maumau307(at)gmail(dot)com> wrote:
>> For information, what kind of breakage would occur?
>
>> I imagined removing KEEPONLYALNUM would just accept
>> non-alphanumeric characters and cause no harm to those who use
>> only alphanumeric characters.
>
> This would break our current usages because of the handling of
> trigrams at the "edges" of groups of qualifying characters. It
> would make similarity (and distance) values less useful for our
> current name searches using it. To simulate the effect, I used an
> '8' in place of a comma instead of recompiling with the suggested
> change.
>
> test=# select show_trgm('smith,john');
> show_trgm
> -----------------------------------------------------------
> {" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
> (1 row)
>
> test=# select show_trgm('smith8john');
> show_trgm
> -----------------------------------------------------
> {" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
> (1 row)
>
> test=# select similarity('smith,john', 'jon smith');
> similarity
> ------------
> 0.615385
> (1 row)
>
> test=# select similarity('smith8john', 'jon smith');
> similarity
> ------------
> 0.3125
> (1 row)
>
> So making the proposed change unconditionally could indeed hurt
> current users of the technique. On the other hand, if there was
> fine-grained control of this, it might make trigrams useful for
> searching statute cites (using all characters) as well as names
> (using the current character set); so I wouldn't want it to just be
> controlled by a global GUC.

Thanks for your explanation. Although I haven't understood it well yet, I'll
consider what you taught. And I'll consider if the tentative measure of
removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
against Japanese text.

Regards
MauMau


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Euler Taveira <euler(at)timbira(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-10 16:59:24
Message-ID: CAHGQGwFDzxn2z3VPMi39pWUf+Cd3ZF-s7tMyPiiKYvZPftiF5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
> Thanks for your explanation. Although I haven't understood it well yet, I'll
> consider what you taught. And I'll consider if the tentative measure of
> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
> against Japanese text.

In Japanese, it's common to do a text search with two characters keyword.
But since pg_trgm is 3-gram, you basically would not be able to use index
for such text search. So you might need something like pg_bigm or pg_unigm
for Japanese text search.

Regards,

--
Fujii Masao


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Euler Taveira <euler(at)timbira(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-10 19:11:57
Message-ID: 22869.1336677117@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> Thanks for your explanation. Although I haven't understood it well yet, I'll
>> consider what you taught. And I'll consider if the tentative measure of
>> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
>> against Japanese text.

> In Japanese, it's common to do a text search with two characters keyword.
> But since pg_trgm is 3-gram, you basically would not be able to use index
> for such text search. So you might need something like pg_bigm or pg_unigm
> for Japanese text search.

I believe the trigrams are three *bytes* not three characters. So a
couple of kanji should work just fine for this.

regards, tom lane


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Euler Taveira" <euler(at)timbira(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-11 13:11:21
Message-ID: 96F7F5F69579425F9B42EAAA8B19CD14@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
> "MauMau" <maumau307(at)gmail(dot)com> wrote:
>> For information, what kind of breakage would occur?
>
>> I imagined removing KEEPONLYALNUM would just accept
>> non-alphanumeric characters and cause no harm to those who use
>> only alphanumeric characters.
>
> This would break our current usages because of the handling of
> trigrams at the "edges" of groups of qualifying characters. It
> would make similarity (and distance) values less useful for our
> current name searches using it. To simulate the effect, I used an
> '8' in place of a comma instead of recompiling with the suggested
> change.
>
> test=# select show_trgm('smith,john');
> show_trgm
> -----------------------------------------------------------
> {" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
> (1 row)
>
> test=# select show_trgm('smith8john');
> show_trgm
> -----------------------------------------------------
> {" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
> (1 row)
>
> test=# select similarity('smith,john', 'jon smith');
> similarity
> ------------
> 0.615385
> (1 row)
>
> test=# select similarity('smith8john', 'jon smith');
> similarity
> ------------
> 0.3125
> (1 row)
>
> So making the proposed change unconditionally could indeed hurt
> current users of the technique. On the other hand, if there was
> fine-grained control of this, it might make trigrams useful for
> searching statute cites (using all characters) as well as names
> (using the current character set); so I wouldn't want it to just be
> controlled by a global GUC.

Thank you for your concise examples. I probably got it.

From your examples, I thought KEEPONLYALNUM controls whether
non-alphanumeric characters are included in trigrams, though I haven't read
the code of pg_trgm. So, removing KEEPONLYALNUM definition produces trigrams
unnecessary for users who handle only alphanumeric text. That would lead to
undesirable query results.

Then, I wonder what would be the ideal specification...to add
alphanumeric/non-alphanumeric boolean switch to similarity() function, add
non-alphanumeric version of operators (ex. %* and <->*) and non-alphanumeric
version of operator classes (ex. gin_allchars_trgm_ops)? At least, I
understood the fix is not appropriate for minor releases.

Regards
MauMau


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: MauMau <maumau307(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Euler Taveira <euler(at)timbira(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-11 15:53:57
Message-ID: CAHGQGwHMru9oYhcPSHr39tU_cnggw7+kX8BJjh6yT4o4_DB2GQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 11, 2012 at 4:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
>> On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>>> Thanks for your explanation. Although I haven't understood it well yet, I'll
>>> consider what you taught. And I'll consider if the tentative measure of
>>> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
>>> against Japanese text.
>
>> In Japanese, it's common to do a text search with two characters keyword.
>> But since pg_trgm is 3-gram, you basically would not be able to use index
>> for such text search. So you might need something like pg_bigm or pg_unigm
>> for Japanese text search.

Even if an index can be used for two characters text search, bitmap index scan
picks up all rows, so it's too slow.

> I believe the trigrams are three *bytes* not three characters.  So a
> couple of kanji should work just fine for this.

Really? As far as I read the code of pg_trgm, the trigram is three characters
and its CRC32 is used as an index key if its size is more than three bytes.

Regards,

--
Fujii Masao