Re: Case-Insensitve Text Comparison

Lists: pgsql-hackers
From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Case-Insensitve Text Comparison
Date: 2008-06-02 03:53:25
Message-ID: D45A8C1C-1387-4B97-B329-67C80A3F1095@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Howdy,

I'm sure I'm just showing off my ignorance here, but here goes…

I really need case-insensitive string comparison in my database.
Ideally there'd be a nice ITEXT data type (and friends, ichar,
ivarchar, etc.). But of course there isn't, and for years I've just
used LOWER() on indexes and queries to get the same result.

Only it turns out that I'm of course not getting the same result. This
script:

#!/usr/local/bin/perl -w

use strict;
use warnings;
use utf8;
binmode STDOUT, ':utf8';
use DBI;

my $dbh = DBI->connect('dbi:Pg:dbname=try', 'postgres', '',
{ pg_enable_utf8 => 1 });
for my $char qw( À Á Â Ã Ä Å Ç Ć Č Ĉ Ċ Ď Đ A B C D ) {
print "$char: ", $dbh->selectrow_array('SELECT LOWER(?)', undef,
$char ), $/;
}

Yields this output:

À: À
Á: Á
Â: Â
Ã: Ã
Ä: Ä
Å: Å
Ç: Ç
Ć: Ć
Č: Č
Ĉ: Ĉ
Ċ: Ċ
Ď: Ď
Đ: Đ
A: a
B: b
C: c
D: d

So it doesn't really work on anything other than ASCII, it looks like.
So I have two questions:

1. Does the use of the tolower() C function in the citext data type on
pgfoundry basically give me the same results as using lower() in my
SQL has for all these years? IOW, does it convert letters to lowercase
in the same way that the LOWER() SQL function does? If so, I think I
might start to use it for my case-insensitive columns and simplify my
SQL a bit.

http://pgfoundry.org/projects/citext/

2. Isn't the ICU library distributed with PostgreSQL? And if so, could
it not be used to create proper case conversions in LOWER() and
friends and, ultimately, to create a case-insensitive text type in
core? I'm seeing that it has a constant named U_COMPARE_IGNORE_CASE
that can be used with its unorm_compare() function:

http://www.icu-project.org/apiref/icu4c/unorm_8h.html#6cc4c8b0d5be7ce1ac4b600ace7817f5
http://www.icu-project.org/apiref/icu4c/unorm_8h.html#991e0fe6f0d062dd6e8e924517f3f437

I don't really know C, but if that's stuff there, can't we take
advantage of it for proper case-insensitive comparisons (and
conversions)?

Thanks,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 04:08:35
Message-ID: 2335.1212379715@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> I really need case-insensitive string comparison in my database.

Okay ... according to whose locale?

> Ideally there'd be a nice ITEXT data type (and friends, ichar,
> ivarchar, etc.). But of course there isn't, and for years I've just
> used LOWER() on indexes and queries to get the same result.

> Only it turns out that I'm of course not getting the same result.

I think that means you're not using the right locale.

> 1. Does the use of the tolower() C function in the citext data type on
> pgfoundry basically give me the same results as using lower() in my
> SQL has for all these years?

[ broken record... ] Kinda depends on your locale. However, tolower()
is 100% guaranteed not to work for multibyte encodings, so citext is
quite useless if you're using UTF8. This is fixable, no doubt, but
it's not fixed in the project as it stands.

> 2. Isn't the ICU library distributed with PostgreSQL?

Nope, it is not, and we have already pretty much determined that we
do not want to make Postgres depend on ICU. See the archives.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 04:13:58
Message-ID: 48437386.5080604@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> Howdy,
>
> I'm sure I'm just showing off my ignorance here, but here goes…
>
> I really need case-insensitive string comparison in my database.

Collation per database level should be help you. It is now under development and
I hope it will be part of 8.4. You can see
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php

Zdenek


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 05:13:07
Message-ID: BC402551-7E1B-4561-8F64-E098C6D58241@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 1, 2008, at 21:08, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> I really need case-insensitive string comparison in my database.
>
> Okay ... according to whose locale?

I'm using C. Of course you're correct that it depends on the locale, I
always forget that. But does not the Unicode standard offer up some
sort locale-independent case-insensitivity, so that it gets it right
some large percentage of the time?

>> Ideally there'd be a nice ITEXT data type (and friends, ichar,
>> ivarchar, etc.). But of course there isn't, and for years I've just
>> used LOWER() on indexes and queries to get the same result.
>
>> Only it turns out that I'm of course not getting the same result.
>
> I think that means you're not using the right locale.

What locale is right? If I have a Web app, there could be data in many
different languages in a single table/column.

>> 1. Does the use of the tolower() C function in the citext data type
>> on
>> pgfoundry basically give me the same results as using lower() in my
>> SQL has for all these years?
>
> [ broken record... ] Kinda depends on your locale. However,
> tolower()
> is 100% guaranteed not to work for multibyte encodings, so citext is
> quite useless if you're using UTF8. This is fixable, no doubt, but
> it's not fixed in the project as it stands.

Right, okay; thanks. I'm thinking about using it for email addresses
and domain names, however, so it might be adequate for those
applications.

>> 2. Isn't the ICU library distributed with PostgreSQL?
>
> Nope, it is not, and we have already pretty much determined that we
> do not want to make Postgres depend on ICU. See the archives.

Damn. Okay, thanks.

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 05:18:55
Message-ID: 3319.1212383935@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jun 1, 2008, at 21:08, Tom Lane wrote:
>> Okay ... according to whose locale?

> I'm using C. Of course you're correct that it depends on the locale, I
> always forget that. But does not the Unicode standard offer up some
> sort locale-independent case-insensitivity, so that it gets it right
> some large percentage of the time?

Not really, and in any case the C locale completely disables any
knowledge of Unicode. C locale knows about 7-bit ASCII and nothing
more.

regards, tom lane


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 05:21:47
Message-ID: Pine.LNX.4.64.0806020919181.21547@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David,

we wrote contrib module (mchar) for one customer, which ports its
application from mssql to postgres. It does case-insensitive comparison
for new data type 'mchar' and linked with ICU for system independent locale.

Oleg

On Sun, 1 Jun 2008, David E. Wheeler wrote:

> Howdy,
>
> I'm sure I'm just showing off my ignorance here, but here goes
>
> I really need case-insensitive string comparison in my database. Ideally
> there'd be a nice ITEXT data type (and friends, ichar, ivarchar, etc.). But
> of course there isn't, and for years I've just used LOWER() on indexes and
> queries to get the same result.
>
> Only it turns out that I'm of course not getting the same result. This
> script:
>
> #!/usr/local/bin/perl -w
>
> use strict;
> use warnings;
> use utf8;
> binmode STDOUT, ':utf8';
> use DBI;
>
> my $dbh = DBI->connect('dbi:Pg:dbname=try', 'postgres', '', { pg_enable_utf8
> => 1 });
> for my $char qw( A B C D ) {
> print "$char: ", $dbh->selectrow_array('SELECT LOWER(?)', undef, $char ),
> $/;
> }
>
> Yields this output:
>
> :
> :
> :
> :
> :
> :
> :
> :
> :
> :
> :
> :
> :
> A: a
> B: b
> C: c
> D: d
>
> So it doesn't really work on anything other than ASCII, it looks like. So I
> have two questions:
>
> 1. Does the use of the tolower() C function in the citext data type on
> pgfoundry basically give me the same results as using lower() in my SQL has
> for all these years? IOW, does it convert letters to lowercase in the same
> way that the LOWER() SQL function does? If so, I think I might start to use
> it for my case-insensitive columns and simplify my SQL a bit.
>
> http://pgfoundry.org/projects/citext/
>
> 2. Isn't the ICU library distributed with PostgreSQL? And if so, could it not
> be used to create proper case conversions in LOWER() and friends and,
> ultimately, to create a case-insensitive text type in core? I'm seeing that
> it has a constant named U_COMPARE_IGNORE_CASE that can be used with its
> unorm_compare() function:
>
> http://www.icu-project.org/apiref/icu4c/unorm_8h.html#6cc4c8b0d5be7ce1ac4b600ace7817f5
> http://www.icu-project.org/apiref/icu4c/unorm_8h.html#991e0fe6f0d062dd6e8e924517f3f437
>
> I don't really know C, but if that's stuff there, can't we take advantage of
> it for proper case-insensitive comparisons (and conversions)?
>
> Thanks,
>
> David
>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 05:36:48
Message-ID: 3DEE0501-D904-427D-8F5C-3CF9231AEE0E@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 1, 2008, at 22:18, Tom Lane wrote:

>> I'm using C. Of course you're correct that it depends on the
>> locale, I
>> always forget that. But does not the Unicode standard offer up some
>> sort locale-independent case-insensitivity, so that it gets it right
>> some large percentage of the time?
>
> Not really, and in any case the C locale completely disables any
> knowledge of Unicode. C locale knows about 7-bit ASCII and nothing
> more.

And the locale can only be set by initdb?

I don't suppose that there are any collations that sort and index case-
insensitively, are there? I don't see anything suggestive in `locale -
a`…

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 05:37:26
Message-ID: E0BE2426-D217-42A6-83DB-0BB47A4A4D3B@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 1, 2008, at 22:21, Oleg Bartunov wrote:

> David,
>
> we wrote contrib module (mchar) for one customer, which ports its
> application from mssql to postgres. It does case-insensitive
> comparison for new data type 'mchar' and linked with ICU for system
> independent locale.

That sounds promising. I don't suppose that it has been released, has
it?

Thanks,

David


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 05:48:36
Message-ID: Pine.LNX.4.64.0806020946400.21547@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 1 Jun 2008, David E. Wheeler wrote:

> On Jun 1, 2008, at 22:21, Oleg Bartunov wrote:
>
>> David,
>>
>> we wrote contrib module (mchar) for one customer, which ports its
>> application from mssql to postgres. It does case-insensitive comparison for
>> new data type 'mchar' and linked with ICU for system independent locale.
>
> That sounds promising. I don't suppose that it has been released, has it?

It's available as a part of patch, see (use google translate)
http://v8.1c.ru/overview/postgres_patches_notes.htm

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Andrew Sullivan <ajs(at)commandprompt(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 13:51:39
Message-ID: 20080602135138.GA57859@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 01, 2008 at 10:13:07PM -0700, David E. Wheeler wrote:

> What locale is right? If I have a Web app, there could be data in many
> different languages in a single table/column.

I think the above amounts to a need for per-session locale settings or
something, no?

A

--
Andrew Sullivan
ajs(at)commandprompt(dot)com
+1 503 667 4564 x104
http://www.commandprompt.com/


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Andrew Sullivan <ajs(at)commandprompt(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 15:39:53
Message-ID: BE3FB7B9-0D60-454A-9A64-4DCFB5340B88@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 2, 2008, at 06:51, Andrew Sullivan wrote:

> On Sun, Jun 01, 2008 at 10:13:07PM -0700, David E. Wheeler wrote:
>
>> What locale is right? If I have a Web app, there could be data in
>> many
>> different languages in a single table/column.
>
> I think the above amounts to a need for per-session locale settings or
> something, no?

Yes, that's what I was getting at.

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 16:13:55
Message-ID: EC2BF19A-C839-485F-8324-996A26788914@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 1, 2008, at 21:08, Tom Lane wrote:

>> 1. Does the use of the tolower() C function in the citext data type
>> on
>> pgfoundry basically give me the same results as using lower() in my
>> SQL has for all these years?
>
> [ broken record... ] Kinda depends on your locale. However,
> tolower()
> is 100% guaranteed not to work for multibyte encodings, so citext is
> quite useless if you're using UTF8. This is fixable, no doubt, but
> it's not fixed in the project as it stands.

Would the use of str_tolower() in formatting.c fix that?

Thanks,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 16:33:30
Message-ID: 19957.1212424410@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jun 1, 2008, at 21:08, Tom Lane wrote:
>> [ broken record... ] Kinda depends on your locale. However,
>> tolower()
>> is 100% guaranteed not to work for multibyte encodings, so citext is
>> quite useless if you're using UTF8. This is fixable, no doubt, but
>> it's not fixed in the project as it stands.

> Would the use of str_tolower() in formatting.c fix that?

Yeah, you need something equivalent to that. I think that whole area
is due for refactoring, though --- we've got kind of a weird collection
of upper/lower/initcap APIs spread through a couple of different files.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 16:45:27
Message-ID: 172A4089-7D72-4AA3-8F3A-59BED6F92119@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 2, 2008, at 09:33, Tom Lane wrote:

>> Would the use of str_tolower() in formatting.c fix that?
>
> Yeah, you need something equivalent to that. I think that whole area
> is due for refactoring, though --- we've got kind of a weird
> collection
> of upper/lower/initcap APIs spread through a couple of different
> files.

And I just ran into this on 8.3 when trying to install citext:

psql:citext.sql:350: ERROR: there is no built-in function named
"oid_text"

I'm assuming that this is because a lot of automatic casts were
removed in 8.3 or 8.2; There are a bunch of these:

CREATE FUNCTION citext(oid) RETURNS citext AS 'oid_text' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION citext(oid) TO PUBLIC;
COMMENT ON FUNCTION citext(oid) IS 'convert oid to citext';
CREATE FUNCTION oid(citext) RETURNS oid AS 'text_oid' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION oid(citext) TO PUBLIC;
COMMENT ON FUNCTION oid(citext) IS 'convert citext to oid';
CREATE CAST (citext AS oid) WITH FUNCTION oid(citext);
CREATE CAST (oid AS citext) WITH FUNCTION citext(oid);

CREATE FUNCTION citext(int2) RETURNS citext AS 'int2_text' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION citext(int2) TO PUBLIC;
COMMENT ON FUNCTION citext(int2) IS 'convert int2 to citext';
CREATE FUNCTION int2(citext) RETURNS int2 AS 'text_int2' LANGUAGE
'internal' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION int2(citext) TO PUBLIC;
COMMENT ON FUNCTION int2(citext) IS 'convert citext to int2';
CREATE CAST (citext AS int2) WITH FUNCTION int2(citext);
CREATE CAST (int2 AS citext) WITH FUNCTION citext(int2);

And on and on. Clearly this module needs updating for newer
PostgreSQLs. I tried removing them all in order to get the data type
and tried it out with this script:

my $dbh = DBI->connect('dbi:Pg:dbname=try', 'postgres', '',
{ pg_enable_utf8 => 1 });
for my $char qw( À Á Â Ã Ä Å Ç Ć Č Ĉ Ċ Ď Đ A B C D ) {
print "$char: ", $dbh->selectrow_array('SELECT LOWER(?::citext)',
undef, $char ), $/;
}

Naturally it didn't work:

À: Ã
Á: á
Â: â
Ã: ã
Ä: ä
Å: Ã¥
Ç: ç
Ć: ć
Č: č
Ĉ: ĉ
Ċ: ċ
Ď: ď
Đ: đ
A: a
B: b
C: c
D: d

BTW, I rebuilt my cluster with --locale en_US.UTF-8 and the script
works on a text type, so having a locale is key.

Thanks,

David


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andrew Sullivan <ajs(at)commandprompt(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 17:29:30
Message-ID: 1212427770.27538.20.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-06-02 at 09:51 -0400, Andrew Sullivan wrote:
> On Sun, Jun 01, 2008 at 10:13:07PM -0700, David E. Wheeler wrote:
>
> > What locale is right? If I have a Web app, there could be data in many
> > different languages in a single table/column.
>
> I think the above amounts to a need for per-session locale settings or
> something, no?
>

What if you had a CHECK constraint that was locale-sensitive? Would the
constraint only be non-false (true or null) for records inserted under
the same locale? That's not very useful.

I think if you want some special treatment of text for some users, it
should be explicit. Text in one locale is really a different type from
text in another locale, and so changing the locale of some text variable
is really a typecast. I don't think GUCs are the correct mechanism for
this.

Regards,
Jeff Davis


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Andrew Sullivan <ajs(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 17:55:16
Message-ID: 20080602175516.GF27752@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 02, 2008 at 10:29:30AM -0700, Jeff Davis wrote:
> I think if you want some special treatment of text for some users, it
> should be explicit. Text in one locale is really a different type from
> text in another locale, and so changing the locale of some text variable
> is really a typecast. I don't think GUCs are the correct mechanism for
> this.

The SQL COLLATE syntax handles this just fine. Either the original
COLLATE patch or the new one will let people tags strings with any
collation they like.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 17:56:32
Message-ID: 1212429392.27538.42.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2008-06-01 at 22:13 -0700, David E. Wheeler wrote:
> What locale is right? If I have a Web app, there could be data in many
> different languages in a single table/column.

I think the values should be explicitly treated differently.

It would be nice if you could just typecast, like:
"lower(somevalue::text(fr_CA))"

which would then lowercase according to the fr_CA locale, regardless of
the locale of "somevalue".

Using typmod for localization was brought up here:
http://archives.postgresql.org/pgsql-hackers/2007-06/msg00635.php

Has it been discussed further? I happen to like the idea of the TEXT
type taking a locale as a typmod. No typmod would, of course, fall back
to the cluster setting. And it would throw an exception if the encoding
couldn't represent that locale.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Andrew Sullivan <ajs(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 18:08:55
Message-ID: 1212430135.27538.47.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-06-02 at 19:55 +0200, Martijn van Oosterhout wrote:
> The SQL COLLATE syntax handles this just fine. Either the original
> COLLATE patch or the new one will let people tags strings with any
> collation they like.

http://wiki.postgresql.org/wiki/Todo:Collate

The last reference I see on that page is from 2005. Is there any updated
information? Are there any major obstacles holding this up aside from
the platform issues mentioned on that page?

Regards,
Jeff Davis


From: Andrew Sullivan <ajs(at)commandprompt(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 18:24:12
Message-ID: 20080602182412.GD67993@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 02, 2008 at 10:29:30AM -0700, Jeff Davis wrote:

> What if you had a CHECK constraint that was locale-sensitive? Would the
> constraint only be non-false (true or null) for records inserted under
> the same locale? That's not very useful.

It would seem that this is one of the important cases that needs to be
worked out. I wasn't suggesting that per-session locale (or whatever
we want to call it) is _easy_ or, for that matter, even possible; just
that it would solve a large number of the problems that people
complain about.

In fact, I suspect that what we really need is something a little more
like "in-database locale" or something.

> I think if you want some special treatment of text for some users, it
> should be explicit.

Yes. Also, not just text. Think of currency, numeric separators, &c.

A

--
Andrew Sullivan
ajs(at)commandprompt(dot)com
+1 503 667 4564 x104
http://www.commandprompt.com/


From: Tino Wildenhain <tino(at)wildenhain(dot)de>
To: Andrew Sullivan <ajs(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 18:38:30
Message-ID: 48443E26.7060303@wildenhain.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Sullivan wrote:
...
>> I think if you want some special treatment of text for some users, it
>> should be explicit.
>
> Yes. Also, not just text. Think of currency, numeric separators, &c.

Which imho, should not really be the business of the type interface
but instead something to_char() and to_{type} handles.

Tino


From: Shane Ambler <pgsql(at)Sheeky(dot)Biz>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Andrew Sullivan <ajs(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 19:50:32
Message-ID: 48444F08.4050505@Sheeky.Biz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

To diverge a little -

Bit of a nood question along these lines -

Does LIKE and ILIKE take into consideration the locale allowing
insensitive searches in any locale setting?

I know that LIKE can use an index if you don't start the match with a
wild card. ILIKE doesn't seem to. Is or would it be possible to get
ILIKE to use a properly configured index as well?

--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Andrew Sullivan <ajs(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-02 21:28:18
Message-ID: 20080602212818.GG27752@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 02, 2008 at 11:08:55AM -0700, Jeff Davis wrote:
> http://wiki.postgresql.org/wiki/Todo:Collate
>
> The last reference I see on that page is from 2005. Is there any updated
> information? Are there any major obstacles holding this up aside from
> the platform issues mentioned on that page?

Well, a review of the patch and a bit of work in the optimiser.
However, I think the patch will have bitrotted beyond any use by now.
It touched many of the areas the operator families stuff touched, for
example.

I beleive it is being reimplemented as a GSoc project, that's probably
a better approach. Should probably just delete the page from the wiki
altogether.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Andrew Sullivan <ajs(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 09:27:29
Message-ID: 48450E81.9040800@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Mon, Jun 02, 2008 at 11:08:55AM -0700, Jeff Davis wrote:
>> http://wiki.postgresql.org/wiki/Todo:Collate
>>
>> The last reference I see on that page is from 2005. Is there any updated
>> information? Are there any major obstacles holding this up aside from
>> the platform issues mentioned on that page?
>
> Well, a review of the patch and a bit of work in the optimiser.
> However, I think the patch will have bitrotted beyond any use by now.
> It touched many of the areas the operator families stuff touched, for
> example.
>
> I beleive it is being reimplemented as a GSoc project, that's probably
> a better approach. Should probably just delete the page from the wiki
> altogether.

The proposal of GSoc is there:
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php

It should create basic framework for full SQL COLLATION support. All comments
are welcome.

Zdenek


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 16:36:45
Message-ID: D3E49BEF-81C4-4336-9BFC-468E6FBAEC21@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 3, 2008, at 02:27, Zdenek Kotala wrote:

> The proposal of GSoc is there:
> http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php
>
> It should create basic framework for full SQL COLLATION support. All
> comments are welcome.

That looks great, Zdenek. I'm very excited to have improved SQL
COLLATION support in core. But if I could ask a dumb question, how
would I specify a case-insensitive collation? Or maybe the Unicode
Collation Algorithm is case-insensitive or has case-insensitive support?

Thanks,

David


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 19:06:48
Message-ID: 48459648.3060706@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> On Jun 3, 2008, at 02:27, Zdenek Kotala wrote:
>
>> The proposal of GSoc is there:
>> http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php
>>
>> It should create basic framework for full SQL COLLATION support. All
>> comments are welcome.
>
> That looks great, Zdenek. I'm very excited to have improved SQL
> COLLATION support in core. But if I could ask a dumb question, how would
> I specify a case-insensitive collation? Or maybe the Unicode Collation
> Algorithm is case-insensitive or has case-insensitive support?

It is simple. SQL standard does not specify notation for that (chapter 11.34).
But there is proposed notation:

CREATE COLLATION <collation name> FOR <character set specification> FROM
<existing collation name> [ <pad characteristic> ] [ <case sensitive> ] [
<accent sensitive> ] [ LC_COLLATE <lc_collate> ] [ LC_CTYPE <lc_ctype> ]

<pad characteristic> := NO PAD | PAD SPACE
<case sensitive> := CASE SENSITIVE | CASE INSENSITIVE
<accent sensitive> := ACCENT SENSITIVE | ACCENT INSENSITIVE

You can specify for each collation if it is case sensitive or not and collation
function should be responsible to correctly handle this flag.

Zdenek


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 19:18:48
Message-ID: 95B7698E-DAAE-4D03-9C33-C9FEAA3E966C@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 3, 2008, at 12:06, Zdenek Kotala wrote:

> It is simple. SQL standard does not specify notation for that
> (chapter 11.34). But there is proposed notation:
>
> CREATE COLLATION <collation name> FOR <character set specification>
> FROM <existing collation name> [ <pad characteristic> ] [ <case
> sensitive> ] [ <accent sensitive> ] [ LC_COLLATE <lc_collate> ]
> [ LC_CTYPE <lc_ctype> ]
>
> <pad characteristic> := NO PAD | PAD SPACE
> <case sensitive> := CASE SENSITIVE | CASE INSENSITIVE
> <accent sensitive> := ACCENT SENSITIVE | ACCENT INSENSITIVE
>
>
> You can specify for each collation if it is case sensitive or not
> and collation function should be responsible to correctly handle
> this flag.

Wooo! Now if only i could apply that on a per-column basis. Still,
it'll be great to have this for a whole database.

Thanks, looking forward to it.

David


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 19:26:06
Message-ID: 48459ACE.4060304@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> On Jun 3, 2008, at 12:06, Zdenek Kotala wrote:
>
>> It is simple. SQL standard does not specify notation for that (chapter
>> 11.34). But there is proposed notation:
>>
>> CREATE COLLATION <collation name> FOR <character set specification>
>> FROM <existing collation name> [ <pad characteristic> ] [ <case
>> sensitive> ] [ <accent sensitive> ] [ LC_COLLATE <lc_collate> ] [
>> LC_CTYPE <lc_ctype> ]
>>
>> <pad characteristic> := NO PAD | PAD SPACE
>> <case sensitive> := CASE SENSITIVE | CASE INSENSITIVE
>> <accent sensitive> := ACCENT SENSITIVE | ACCENT INSENSITIVE
>>
>>
>> You can specify for each collation if it is case sensitive or not and
>> collation function should be responsible to correctly handle this flag.
>
> Wooo! Now if only i could apply that on a per-column basis. Still, it'll
> be great to have this for a whole database.

The first step is per database, because it is relative easy. Collation
per-column is very difficult. It requires a lot of changes (parser, planer,
executor...) in whole source code, because you need to keep collation
information together with text data.

It is reason why this task is split to severals part.

Zdenek


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 19:45:12
Message-ID: 162867790806031245p56258e24t81fa23e6d1a49266@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2008/6/3 Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>:
> David E. Wheeler napsal(a):
>>
>> On Jun 3, 2008, at 02:27, Zdenek Kotala wrote:
>>
>>> The proposal of GSoc is there:
>>> http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php
>>>
>>> It should create basic framework for full SQL COLLATION support. All
>>> comments are welcome.
>>
>> That looks great, Zdenek. I'm very excited to have improved SQL COLLATION
>> support in core. But if I could ask a dumb question, how would I specify a
>> case-insensitive collation? Or maybe the Unicode Collation Algorithm is
>> case-insensitive or has case-insensitive support?
>
> It is simple. SQL standard does not specify notation for that (chapter
> 11.34). But there is proposed notation:
>
> CREATE COLLATION <collation name> FOR <character set specification> FROM
> <existing collation name> [ <pad characteristic> ] [ <case sensitive> ] [
> <accent sensitive> ] [ LC_COLLATE <lc_collate> ] [ LC_CTYPE <lc_ctype> ]
>
> <pad characteristic> := NO PAD | PAD SPACE
> <case sensitive> := CASE SENSITIVE | CASE INSENSITIVE
> <accent sensitive> := ACCENT SENSITIVE | ACCENT INSENSITIVE
>

it is in conformance with others databases? Or what is syntax used in others db?

regards
Pavel

>
> You can specify for each collation if it is case sensitive or not and
> collation function should be responsible to correctly handle this flag.
>
>
> Zdenek
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 20:18:32
Message-ID: 4845A718.9010008@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule napsal(a):
> 2008/6/3 Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>:
>> David E. Wheeler napsal(a):
>>> On Jun 3, 2008, at 02:27, Zdenek Kotala wrote:
>>>
>>>> The proposal of GSoc is there:
>>>> http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php
>>>>
>>>> It should create basic framework for full SQL COLLATION support. All
>>>> comments are welcome.
>>> That looks great, Zdenek. I'm very excited to have improved SQL COLLATION
>>> support in core. But if I could ask a dumb question, how would I specify a
>>> case-insensitive collation? Or maybe the Unicode Collation Algorithm is
>>> case-insensitive or has case-insensitive support?
>> It is simple. SQL standard does not specify notation for that (chapter
>> 11.34). But there is proposed notation:
>>
>> CREATE COLLATION <collation name> FOR <character set specification> FROM
>> <existing collation name> [ <pad characteristic> ] [ <case sensitive> ] [
>> <accent sensitive> ] [ LC_COLLATE <lc_collate> ] [ LC_CTYPE <lc_ctype> ]
>>
>> <pad characteristic> := NO PAD | PAD SPACE
>> <case sensitive> := CASE SENSITIVE | CASE INSENSITIVE
>> <accent sensitive> := ACCENT SENSITIVE | ACCENT INSENSITIVE
>>
>
> it is in conformance with others databases? Or what is syntax used in others db?
>
It seems to me, that CREATE COLLATION command supports only firebird other
databases like MySQL, MS SQL have hardcoded list of collations.

Zdenek


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-03 20:53:56
Message-ID: 1212526436.31688.18.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2008-06-03 at 21:26 +0200, Zdenek Kotala wrote:
> The first step is per database, because it is relative easy. Collation
> per-column is very difficult. It requires a lot of changes (parser, planer,
> executor...) in whole source code, because you need to keep collation
> information together with text data.

Right now typmod is already passed to all those layers, right? Would it
be a useful intermediate step to use typmod to hold this information for
the text type?

I suppose that would only work for text and not varchar, because varchar
already has a use for typmod.

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-04 04:06:17
Message-ID: 16429.1212552377@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> Right now typmod is already passed to all those layers, right? Would it
> be a useful intermediate step to use typmod to hold this information for
> the text type?

No, it would not, because typmod doesn't propagate through functions.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-04 04:35:40
Message-ID: 20080604043540.GA1858@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 03, 2008 at 01:53:56PM -0700, Jeff Davis wrote:
> On Tue, 2008-06-03 at 21:26 +0200, Zdenek Kotala wrote:
> > The first step is per database, because it is relative easy. Collation
> > per-column is very difficult. It requires a lot of changes (parser, planer,
> > executor...) in whole source code, because you need to keep collation
> > information together with text data.
>
> Right now typmod is already passed to all those layers, right? Would it
> be a useful intermediate step to use typmod to hold this information for
> the text type?

In SQL the collation is associated with a node in the parse tree and
not with the values at all. It's a sort of extra parameter to functions
(at least, that's how I implemented it). So you can say things like:

SELECT text COLLATE case_insensetive;

Here the collate clause does nothing, though if you had a SELECT INTO
it would control the default collation for that column. The standard
has rules on how to determine what the collation at any point (explicit
overrides implicit overrides default). If two columns have conflicting
collations, when comparing them you are required to disambiguate or
it's an (parse-time) error.

Check the archives for details on how it works precisely, but it's far
nicer than merely adding an typmod, since that would cause you to throw
errors at runtime if there's a problem.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Case-Insensitve Text Comparison
Date: 2008-06-05 16:38:07
Message-ID: 1212683887.7900.0.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2008-06-04 at 06:35 +0200, Martijn van Oosterhout wrote:
> Check the archives for details on how it works precisely, but it's far
> nicer than merely adding an typmod, since that would cause you to throw
> errors at runtime if there's a problem.

Ok, that makes sense. I agree that any type mismatches should cause a
compile-time error. Thanks for the explanation.

Regards,
Jeff Davis