Re: pg_trgm: unicode string not working

Lists: pgsql-hackers
From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: pg_trgm: unicode string not working
Date: 2011-06-12 11:33:40
Message-ID: 1307878420.1880.4.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.

Here is the table:
\d words
Table "public.words"
Column | Type | Modifiers
--------+---------+-----------
word | text |
ndoc | integer |
nentry | integer |
Indexes:
"words_idx" gin (word gin_trgm_ops)

Query: select word from words where word % 'कतद';

I get an error:

ERROR: GIN indexes do not support whole-index scans

Any idea what is wrong?

-Sushant.


From: Florian Pflug <fgp(at)phlo(dot)org>
To: sushant354(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-12 12:40:06
Message-ID: 0A0622C3-C778-4C6A-9345-0A6D81929BCF@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

Next time, please post questions regarding the usage of postgres
to the -general list, not to -hackers. The purpose of -hackers is
to discuss the development of postgres proper, not the development
of applications using postgres.

On Jun12, 2011, at 13:33 , Sushant Sinha wrote:
> I am using pg_trgm for spelling correction as prescribed in the
> documentation. But I see that it does not work for unicode sring. The
> database was initialized with utf8 encoding and the C locale.

I think you need to use a locale (more precisely, a CTYPE) in which
'क', 'त', 'द' are considered to be alphanumeric.

You can specify the CTYPE when creating the database with
CREATE DATABASE ... LC_CTYPE = ...

> Here is the table:
> \d words
> Table "public.words"
> Column | Type | Modifiers
> --------+---------+-----------
> word | text |
> ndoc | integer |
> nentry | integer |
> Indexes:
> "words_idx" gin (word gin_trgm_ops)
>
> Query: select word from words where word % 'कतद';
>
> I get an error:
>
> ERROR: GIN indexes do not support whole-index scans

pg_trgm probably ignores non-alphanumeric characters during
comparison, so you end up with an empty search string, which
translates to a whole-index scan. Postgres up to 9.0 does
not support such scans for GIN indices.

Note that this restriction was removed in postgres 9.1 which
is currently in beta. However, GIT indices must be re-created
with REINDEX after upgrading from 9.0 to leverage that
improvement.

best regards.
Florian Pflug


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-13 02:58:00
Message-ID: BANLkTinMxhj6DwMWU8tboWSe99hJ_SNraA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> Note that this restriction was removed in postgres 9.1 which
> is currently in beta. However, GIT indices must be re-created
> with REINDEX after upgrading from 9.0 to leverage that
> improvement.

Does pg_upgrade know about this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-13 23:47:31
Message-ID: 201106132347.p5DNlVp03226@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> > Note that this restriction was removed in postgres 9.1 which
> > is currently in beta. However, GIT indices must be re-created
> > with REINDEX after upgrading from 9.0 to leverage that
> > improvement.
>
> Does pg_upgrade know about this?

No, it does not. Under what circumstances should I issue a suggestion
to reindex, and what should the text be?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-14 01:58:57
Message-ID: BANLkTi=4YX6UsGQCR9s-88Qejtnskbsi6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Robert Haas wrote:
>> On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> > Note that this restriction was removed in postgres 9.1 which
>> > is currently in beta. However, GIT indices must be re-created
>> > with REINDEX after upgrading from 9.0 to leverage that
>> > improvement.
>>
>> Does pg_upgrade know about this?
>
> No, it does not.  Under what circumstances should I issue a suggestion
> to reindex, and what should the text be?

It sounds like GIN indexes need to be reindexed after upgrading from <
9.1 to >= 9.1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-14 02:11:16
Message-ID: 201106140211.p5E2BGc29320@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > Robert Haas wrote:
> >> On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> >> > Note that this restriction was removed in postgres 9.1 which
> >> > is currently in beta. However, GIT indices must be re-created
> >> > with REINDEX after upgrading from 9.0 to leverage that
> >> > improvement.
> >>
> >> Does pg_upgrade know about this?
> >
> > No, it does not. ?Under what circumstances should I issue a suggestion
> > to reindex, and what should the text be?
>
> It sounds like GIN indexes need to be reindexed after upgrading from <
> 9.1 to >= 9.1.

I already have some GIN tests I used for 8.3 to 8.4 so that is easy, but
is the reindex required or just suggested for features?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-14 05:15:20
Message-ID: 17138.1308028520@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> No, it does not. Under what circumstances should I issue a suggestion
>> to reindex, and what should the text be?

> It sounds like GIN indexes need to be reindexed after upgrading from <
> 9.1 to >= 9.1.

Only if you care whether they work for corner cases such as empty
arrays ... corner cases which didn't work before 9.1, so very likely
you don't care.

I'm not sure that pg_upgrade is a good vehicle for dispensing such
advice, anyway. At least in the Red Hat packaging, end users will never
read what it prints, unless maybe it fails outright and they're trying
to debug why.

regards, tom lane


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-14 09:02:37
Message-ID: 3363CE45-F058-4EC3-A952-78E1C52B0826@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun14, 2011, at 07:15 , Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> No, it does not. Under what circumstances should I issue a suggestion
>>> to reindex, and what should the text be?
>
>> It sounds like GIN indexes need to be reindexed after upgrading from <
>> 9.1 to >= 9.1.
>
> Only if you care whether they work for corner cases such as empty
> arrays ... corner cases which didn't work before 9.1, so very likely
> you don't care.

We also already say "To fix this, do REINDEX INDEX ... " in the errhint
of "old GIN indexes do not support whole-index scans nor searches for nulls".

best regards,
Florian Pflug


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, sushant354(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_trgm: unicode string not working
Date: 2011-06-14 12:08:22
Message-ID: BANLkTi=-OFD6SGzjifLSkUMJMSKO--8GLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 14, 2011 at 1:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm not sure that pg_upgrade is a good vehicle for dispensing such
> advice, anyway.  At least in the Red Hat packaging, end users will never
> read what it prints, unless maybe it fails outright and they're trying
> to debug why.

In my experience to date, that happens 100% of the time. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company