Quick Links

Re: Patch for collation using ICU

Lists:	pgsql-hackers

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Patch for collation using ICU
Date:	2005-03-24 23:40:04
Message-ID:	D4499599155880AAEA4319AE@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

I've put together a patch for using IBM's ICU package for collation.

If your OS does not have full support for collation ur uppercase/lowercase
in multibyte locales, this might be useful. If you are using a multibyte
character encoding in your database and want collation, i.e. order by, and
also lower(), upper() and initcap() to work properly, this patch will do
just that.

This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is, wcscoll(3) does not do
what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the
patch is *not* necessary for Linux, although IBM claims ICU collation to be
about twice as fast as glibc for simple western locales.

It adds a configure switch, `--with-icu', which will set up the code to use
ICU instead of wchar_t and wcscoll.

This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative performance
tests yet, but it seems it is at least not slower than using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.

I'd be delighted if some more experienced postgresql hackers would review
this stuff. The patch is pretty compact, so it's fast reading :) I'm
planning to add this patch as an option (tagged "experimental") to
FreeBSD's postgresql port. Any ideas about whether this is a good idea or
not?

Any thoughts or ideas are welcome!

Cheers,
Palle

Patch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>

ICU at sourceforge: <http://icu.sf.net/>

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	John Hansen <john(at)geeknet(dot)com(dot)au>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: Patch for collation using ICU
Date:	2005-03-26 02:09:53
Message-ID:	55C6D914B6055CD5721BEC40@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
<girgen(at)pingpong(dot)net> wrote:

> Hi!
>
> I've put together a patch for using IBM's ICU package for collation.
>
> If your OS does not have full support for collation ur
> uppercase/lowercase in multibyte locales, this might be useful. If you
> are using a multibyte character encoding in your database and want
> collation, i.e. order by, and also lower(), upper() and initcap() to work
> properly, this patch will do just that.
>
> This patch is needed for FreeBSD, since this OS has no support for
> collation of for example unicode locales (that is, wcscoll(3) does not do
> what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the
> patch is *not* necessary for Linux, although IBM claims ICU collation to
> be about twice as fast as glibc for simple western locales.
>
> It adds a configure switch, `--with-icu', which will set up the code to
> use ICU instead of wchar_t and wcscoll.
>
> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
> seems to run well. I've not had the time to do any comparative
> performance tests yet, but it seems it is at least not slower than using
> LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
>
> I'd be delighted if some more experienced postgresql hackers would review
> this stuff. The patch is pretty compact, so it's fast reading :) I'm
> planning to add this patch as an option (tagged "experimental") to
> FreeBSD's postgresql port. Any ideas about whether this is a good idea or
> not?
>
> Any thoughts or ideas are welcome!
>
> Cheers,
> Palle
>
> Patch at:
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.d
> iff>
>
> ICU at sourceforge: <http://icu.sf.net/>

Hi!

There's a new patch to fix some reported problems.

<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-26.diff>

This version uses the DatabaseEncoding and sets the ICU encoding at the
same time. I had to create a conversion table from PostgreSQL's own,
somewhat odd and non-standard, names of encodings, into the prefered IANA
names. On or two of the more odd ones might be slightly incorrect,
hopefully not too far off anyway?

I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:

- ORDER BY is case insensitive when using ICU. This might break the SQL
standard (?), but sure is nice :)

- When the database is initialized using the C locale, upper() and lower()
normally does not work at all for non-ASCII characters even if the
database's encoding is say LATIN1 or UNICODE. (does not work for me anyway,
on FreeBSD, and this is probably correct since the locale is still `C', I
believe?). The ICU patch changes nothing for the LATIN1 case, since it does
not act on single byte encodings, but for the UNICODE representation, it
works and does what I expect it to, namely upper() and lower() neatly
upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') -> 'åäö'.
This is a good thing, although I'm surprised that upper/lower is dragged
along with the LC_COLLATE fixation at initdb. I never run initdb in the C
locale, but only now do I realize how broken that really is if you need to
store anything else than English :-)

I'd be delighted to get more feedback about this stuff.

Thanks,
Palle

From:	Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, John Hansen <john(at)geeknet(dot)com(dot)au>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: Patch for collation using ICU
Date:	2005-03-26 16:16:01
Message-ID:	20050326080458.E63597@megazone.bigpanda.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 26 Mar 2005, Palle Girgensohn wrote:
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
>
> - ORDER BY is case insensitive when using ICU. This might break the SQL
> standard (?), but sure is nice :)

Err, I think if your system implements strcoll correctly 8.0.1 can do this
if the chosen collation is set up that way (or at least naive tests I've
done seem to imply that). Or are you speaking about C locale?

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, John Hansen <john(at)geeknet(dot)com(dot)au>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: Patch for collation using ICU
Date:	2005-03-27 00:50:03
Message-ID:	EFD91E88714036D5A5EBA996@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
<sszabo(at)megazone(dot)bigpanda(dot)com> wrote:

> On Sat, 26 Mar 2005, Palle Girgensohn wrote:
>> I've noticed a couple of things about using the ICU patch vs. pristine
>> pg-8.0.1:
>>
>> - ORDER BY is case insensitive when using ICU. This might break the SQL
>> standard (?), but sure is nice :)
>
> Err, I think if your system implements strcoll correctly 8.0.1 can do this
> if the chosen collation is set up that way (or at least naive tests I've
> done seem to imply that). Or are you speaking about C locale?

No, I doubt this.

Example: set up a cluster:
$ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
$ createdb foo
CREATE DATABASE
$ psql foo
foo=# create table bar (val text);
CREATE TABLE
foo=# insert into bar values ('aaa');
INSERT 18354409 1
foo=# insert into bar values ('BBB');
INSERT 18354412 1
foo=# select val from bar order by val;
val
-----
BBB
aaa
(2 rows)

Order by is not case insensitive. It shouldn't be for any system, AFAIK. As
John Hansen noted, this might be a bad thing. I'm not sure about that,
though...

As for general collation of unicode, the reason for me to use ICU is that
my system does not support strcoll correctly for multibyte locales, as I
mentioned earlier. I also noted that even for systems that do handle
strcoll correctly for unicode, ICU claims to be a couple of magnitudes
faster, so this patch might be useful for other systems (read Linux) as
well. See previous emails for details.

Regards,
Palle

From:	Hannu Krosing <hannu(at)tm(dot)ee>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, John Hansen <john(at)geeknet(dot)com(dot)au>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: Patch for collation using ICU
Date:	2005-03-27 01:34:03
Message-ID:	1111887244.5524.2.camel@fuji.krosing.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote:

> Hi!
>
...
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
>
> - ORDER BY is case insensitive when using ICU. This might break the SQL
> standard (?), but sure is nice :)

How does your patch interact with the ability to use indexes for
anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ?

--
Hannu Krosing <hannu(at)tm(dot)ee>

From:	Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, John Hansen <john(at)geeknet(dot)com(dot)au>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: Patch for collation using ICU
Date:	2005-03-27 01:40:01
Message-ID:	20050326173421.F87420@megazone.bigpanda.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 27 Mar 2005, Palle Girgensohn wrote:

>
>
> --On lrdag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
> <sszabo(at)megazone(dot)bigpanda(dot)com> wrote:
>
> > On Sat, 26 Mar 2005, Palle Girgensohn wrote:
> >> I've noticed a couple of things about using the ICU patch vs. pristine
> >> pg-8.0.1:
> >>
> >> - ORDER BY is case insensitive when using ICU. This might break the SQL
> >> standard (?), but sure is nice :)
> >
> > Err, I think if your system implements strcoll correctly 8.0.1 can do this
> > if the chosen collation is set up that way (or at least naive tests I've
> > done seem to imply that). Or are you speaking about C locale?
>
> No, I doubt this.
>
> Example: set up a cluster:
> $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
> $ createdb foo
> CREATE DATABASE
> $ psql foo
> foo=# create table bar (val text);
> CREATE TABLE
> foo=# insert into bar values ('aaa');
> INSERT 18354409 1
> foo=# insert into bar values ('BBB');
> INSERT 18354412 1
> foo=# select val from bar order by val;
> val
> -----
> BBB
> aaa
> (2 rows)
>
>
> Order by is not case insensitive. It shouldn't be for any system, AFAIK. As

It is on my machine... for the same test:

foo=# select val from bar order by val;
val
-----
aaa
BBB
(2 rows)

I think this just implies even greater breakage of either the collation or
strcoll on the system you're trying on. ;) Which, of course, is a fairly
reasonable reason to offer an alternative. Especially if it's generically
useful.

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, John Hansen <john(at)geeknet(dot)com(dot)au>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: Patch for collation using ICU
Date:	2005-03-27 03:12:19
Message-ID:	B3322977A55B6FBA7897A9EA@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On lördag, mars 26, 2005 17.40.01 -0800 Stephan Szabo
<sszabo(at)megazone(dot)bigpanda(dot)com> wrote:

>
> On Sun, 27 Mar 2005, Palle Girgensohn wrote:
>
>>
>>
>> --On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
>> <sszabo(at)megazone(dot)bigpanda(dot)com> wrote:
>>
>> > On Sat, 26 Mar 2005, Palle Girgensohn wrote:
>> >> I've noticed a couple of things about using the ICU patch vs. pristine
>> >> pg-8.0.1:
>> >>
>> >> - ORDER BY is case insensitive when using ICU. This might break the
>> >> SQL standard (?), but sure is nice :)
>> >
>> > Err, I think if your system implements strcoll correctly 8.0.1 can do
>> > this if the chosen collation is set up that way (or at least naive
>> > tests I've done seem to imply that). Or are you speaking about C
>> > locale?
>>
>> No, I doubt this.
>>
>> Example: set up a cluster:
>> $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
>> $ createdb foo
>> CREATE DATABASE
>> $ psql foo
>> foo=# create table bar (val text);
>> CREATE TABLE
>> foo=# insert into bar values ('aaa');
>> INSERT 18354409 1
>> foo=# insert into bar values ('BBB');
>> INSERT 18354412 1
>> foo=# select val from bar order by val;
>> val
>> -----
>> BBB
>> aaa
>> (2 rows)
>>
>>
>> Order by is not case insensitive. It shouldn't be for any system, AFAIK.
>> As
>
> It is on my machine... for the same test:
>
> foo=# select val from bar order by val;
> val
> -----
> aaa
> BBB
> (2 rows)
>
> I think this just implies even greater breakage of either the collation or
> strcoll on the system you're trying on. ;) Which, of course, is a fairly
> reasonable reason to offer an alternative. Especially if it's generically
> useful.

Interesting! Indeed, just tried on an old Linux Redhat system... BTW,
that's pretty odd for a unix system. "ls -l" sorts aaa before BBB, I've
never seen the likes of it! Call me old fashion if you like ;-)

Still, as you say, FreeBSD does it capital letters first, and does not
handle unicode locales' collation, so I need an alternative. Perhaps the
best way would be to inject ICU into BSD instead :-)

/Palle

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Hannu Krosing <hannu(at)tm(dot)ee>
Cc:	pgsql-hackers(at)postgresql(dot)org, John Hansen <john(at)geeknet(dot)com(dot)au>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: Patch for collation using ICU
Date:	2005-03-29 22:14:53
Message-ID:	191F88947BE3DF929975A3E3@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On söndag, mars 27, 2005 04.34.03 +0300 Hannu Krosing <hannu(at)tm(dot)ee> wrote:

> On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote:
>
>> Hi!
>>
> ...
>> I've noticed a couple of things about using the ICU patch vs. pristine
>> pg-8.0.1:
>>
>> - ORDER BY is case insensitive when using ICU. This might break the SQL
>> standard (?), but sure is nice :)

Just a comment: ORDER BY *is* already case sensitive on Linux, since its
strcoll ignores case. I doubt very much it violates SQL standards.

> How does your patch interact with the ability to use indexes for
> anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ?

I don't think it matters. You still need to use the special non-locale
index functions described in the handbook to get anchored like queries use
indices. My ICU patch does not alter this. ICU is "injected" where strings
are compared and the database cluster was initialized with a multibyte
character encoding.

The problem, AFAIK, has to do with the nature of (some) locales, not with a
specific implementation of collation.

Regards,
Palle

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-03-30 19:43:23
Message-ID:	200503302143.24154.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Palle Girgensohn wrote:
> Just a comment: ORDER BY *is* already case sensitive on Linux, since
> its strcoll ignores case. I doubt very much it violates SQL
> standards.

The behavior of collation sequences is implementation-defined. So as
long as you can put the behavior in words, it should be OK.

It would seem, however, that the behavior of a certain locale name
should be the same with or without ICU, so perhaps some locale renaming
might be needed, but that is speculation on my part.

> > How does your patch interact with the ability to use indexes for
> > anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use
> > index) ?

> The problem, AFAIK, has to do with the nature of (some) locales, not
> with a specific implementation of collation.

Yeah, pretty much the whole point of that code is to avoid collating
stuff.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 02:57:59
Message-ID:	200505070257.j472vxg01285@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Is this patch ready for application?

http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.diff.gz

The web site is:

http://people.freebsd.org/~girgen/postgresql-icu/readme.html

I do have a few questions:

Why don't you use the lc_ctype_is_c() part of this test?

if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Why is so much code added, for example, in lower()? The existing
multibyte code is much smaller, and lots of code is added in other
places too.

Why do you need to add a mapping of encoding names from iana to our
names?

---------------------------------------------------------------------------

Palle Girgensohn wrote:
> Hi!
>
> I've put together a patch for using IBM's ICU package for collation.
>
> If your OS does not have full support for collation ur uppercase/lowercase
> in multibyte locales, this might be useful. If you are using a multibyte
> character encoding in your database and want collation, i.e. order by, and
> also lower(), upper() and initcap() to work properly, this patch will do
> just that.
>
> This patch is needed for FreeBSD, since this OS has no support for
> collation of for example unicode locales (that is, wcscoll(3) does not do
> what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the
> patch is *not* necessary for Linux, although IBM claims ICU collation to be
> about twice as fast as glibc for simple western locales.
>
> It adds a configure switch, `--with-icu', which will set up the code to use
> ICU instead of wchar_t and wcscoll.
>
> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
> seems to run well. I've not had the time to do any comparative performance
> tests yet, but it seems it is at least not slower than using LATIN1 with
> sv_SE.ISO8859-1 locale, perhaps even faster.
>
> I'd be delighted if some more experienced postgresql hackers would review
> this stuff. The patch is pretty compact, so it's fast reading :) I'm
> planning to add this patch as an option (tagged "experimental") to
> FreeBSD's postgresql port. Any ideas about whether this is a good idea or
> not?
>
> Any thoughts or ideas are welcome!
>
> Cheers,
> Palle
>
> Patch at:
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
>
> ICU at sourceforge: <http://icu.sf.net/>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Palle Girgensohn <girgen(at)pingpong(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 03:31:20
Message-ID:	28423.1115436680@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Is this patch ready for application?

Not until ICU is released under a BSD license ...

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Palle Girgensohn <girgen(at)pingpong(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 03:40:08
Message-ID:	200505070340.j473e8p14124@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Is this patch ready for application?
>
> Not until ICU is released under a BSD license ...

Well, readline isn't BSD either, but we use it. It is any different?

From:	Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 03:54:17
Message-ID:	slrnd7oev9.2ep3.andrew+nonews@trinity.supernews.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2005-05-07, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> Tom Lane wrote:
>> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> > Is this patch ready for application?
>>
>> Not until ICU is released under a BSD license ...
>
> Well, readline isn't BSD either, but we use it. It is any different?

ICU appears to be under the X license, which is no more restrictive than
BSD-with-no-advertising-clause.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Palle Girgensohn <girgen(at)pingpong(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 05:17:14
Message-ID:	29261.1115443034@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> Not until ICU is released under a BSD license ...

> Well, readline isn't BSD either, but we use it. It is any different?

Did you read the license? Some of the more troubling bits:

: It is the understanding of INTERNATIONAL BUSINESS MACHINES CORPORATION
: that the purpose for which its publications are being reproduced is
: accurate and true as stated in your attached request.

(er, which attached request would that be?)

: Permission to quote from or reprint IBM publications is limited to the
: purpose and quantities originally requested and must not be construed as
: a blanket license to use the material for other purposes or to reprint
: other IBM copyrighted material.

: IBM reserves the right to withdraw permission to reproduce copyrighted
: material whenever, in its discretion, it feels that the privilege of
: reproducing its material is being used in a way detrimental to its
: interest or the above instructions are not being followed properly to
: protect its copyright.

: IBM may have patents or pending patent applications covering subject
: matter in this document. The furnishing of this document does not give
: you any license to these patents. You can send license inquiries, in
: writing, to:

: For license inquiries regarding double-byte (DBCS) information, contact
: the IBM Intellectual Property Department in your country or send
: inquiries, in writing, to:

regards, tom lane

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 11:20:53
Message-ID:	22847A332D562C83ECDFA009@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On fredag, maj 06, 2005 23.31.20 -0400 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> Is this patch ready for application?
>
> Not until ICU is released under a BSD license ...

It's not GPL anyway. Seems pretty much like the BSD license, at least more
BSD-ish than GPL-ish.

<http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html>

/Palle

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 12:07:34
Message-ID:	966F3D72826AD115FCE40984@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On fredag, maj 06, 2005 22.57.59 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

>
> Is this patch ready for application?
>
> http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.d
> iff.gz
>
> The web site is:
>
> http://people.freebsd.org/~girgen/postgresql-icu/readme.html

I don't think so, not quite. I have not had any positive reports from linux
users, this is only tested in a FreeBSD environment. I'd say it needs some
more testing.

Also, apparently, ICU is installed by default in many linux distributions,
and usually it is version 2.8. Some linux users have asked me if there are
plans for a patch that works with ICU 2.8. That's probably a good idea. IBM
and the ICU folks seem to consider 3.2 to be the stable version, older
versions are hard to find on their sites, but most linux distributers seem
to consider it too bleeding edge, even gentoo. I don't know why they don't
agree.

> I do have a few questions:
>
> Why don't you use the lc_ctype_is_c() part of this test?
>
> if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Um, well, I didn't think about that. :) What would be the locale in this
case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide
encoding, indeed. Then the strings will be handled like byte-wide chars.
Yeah, it's a bug. I'll fix it! Thanks.

> Why is so much code added, for example, in lower()? The existing
> multibyte code is much smaller, and lots of code is added in other
> places too.

ICU uses UTF-16 internally, so all strings must be converted from the
database encoding to UTF-16. Since that means the strings need to be
copied, I took the same approach as in varlena.c:varstr_cmp(), where small
strings use the heap and only larger strings use a palloc. Comments in
varstr_cmp about performance made me use that approach.

Also, in the latest patch, I also added checks and logging for *every*
status returned from ICU. I hope this will help debugging on debian, where
previous version didn't work. That excessive status checking is hardly be
necessary once the stuff is better tested.

I think the string copying and heap/palloc choices stands for most of the
code bloat, together with the excessive status checking and logging.

> Why do you need to add a mapping of encoding names from iana to our
> names?

This was already answered by John Hansen... There's an old thread here
about the choice of the name "UNICODE" to describe an encoding, which it
doesn't. There's half a dozen unicode based encodings... UTF-8 is used by
postgresql, that would have been a better name... Similarly for most other
encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL
can't provide it, so a mapping table is required.

I use this patch in production on one FreeBSD 4.10 server at the moment.
With the latest version, I've had no problems. Logging is swithed on for
now, and it shows no signs of ICU complaining. I'd like more reports on
Linux, though.

/Palle

>
> -------------------------------------------------------------------------
> --
>
> Palle Girgensohn wrote:
>> Hi!
>>
>> I've put together a patch for using IBM's ICU package for collation.
>>
>> If your OS does not have full support for collation ur
>> uppercase/lowercase in multibyte locales, this might be useful. If you
>> are using a multibyte character encoding in your database and want
>> collation, i.e. order by, and also lower(), upper() and initcap() to
>> work properly, this patch will do just that.
>>
>> This patch is needed for FreeBSD, since this OS has no support for
>> collation of for example unicode locales (that is, wcscoll(3) does not
>> do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK
>> the patch is *not* necessary for Linux, although IBM claims ICU
>> collation to be about twice as fast as glibc for simple western locales.
>>
>> It adds a configure switch, `--with-icu', which will set up the code to
>> use ICU instead of wchar_t and wcscoll.
>>
>> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
>> seems to run well. I've not had the time to do any comparative
>> performance tests yet, but it seems it is at least not slower than
>> using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
>>
>> I'd be delighted if some more experienced postgresql hackers would
>> review this stuff. The patch is pretty compact, so it's fast reading :)
>> I'm planning to add this patch as an option (tagged "experimental") to
>> FreeBSD's postgresql port. Any ideas about whether this is a good idea
>> or not?
>>
>> Any thoughts or ideas are welcome!
>>
>> Cheers,
>> Palle
>>
>> Patch at:
>> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.
>> diff>
>>
>> ICU at sourceforge: <http://icu.sf.net/>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: don't forget to increase your free space map settings
>>
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania
> 19073

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 12:37:05
Message-ID:	200505071237.j47Cb6212530@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Palle Girgensohn wrote:
> >
> > Is this patch ready for application?
>
> I don't think so, not quite. I have not had any positive reports from linux
> users, this is only tested in a FreeBSD environment. I'd say it needs some
> more testing.

OK.

> Also, apparently, ICU is installed by default in many linux distributions,
> and usually it is version 2.8. Some linux users have asked me if there are
> plans for a patch that works with ICU 2.8. That's probably a good idea. IBM
> and the ICU folks seem to consider 3.2 to be the stable version, older
> versions are hard to find on their sites, but most linux distributers seem
> to consider it too bleeding edge, even gentoo. I don't know why they don't
> agree.

Good point. Why would linux folks need ICU? Doesn't their OS support
encodings natively? I am particularly excited about this for OSs that
don't have such encodings, like UTF8 support for Win32.

Because ICU will not be used unless enabled by configure, it seems we
are fine with only supporting the newest version. Do Linux users need
to use ICU for any reason?

> > I do have a few questions:
> >
> > Why don't you use the lc_ctype_is_c() part of this test?
> >
> > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
>
> Um, well, I didn't think about that. :) What would be the locale in this
> case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide
> encoding, indeed. Then the strings will be handled like byte-wide chars.
> Yeah, it's a bug. I'll fix it! Thanks.

The additional test is more of an optmization, and it fixes a problem
with some OSs that have processing problems with UTF8 when the locale is
supposed to be turned off, like in "C". I realize ICU might be fine
with it but the optimization still is an issue.

> > Why is so much code added, for example, in lower()? The existing
> > multibyte code is much smaller, and lots of code is added in other
> > places too.
>
> ICU uses UTF-16 internally, so all strings must be converted from the
> database encoding to UTF-16. Since that means the strings need to be
> copied, I took the same approach as in varlena.c:varstr_cmp(), where small
> strings use the heap and only larger strings use a palloc. Comments in
> varstr_cmp about performance made me use that approach.

Oh, interesting. I think you need to create new functions that
factor out that common code so the patch is smaller and easier to
maintain.

> Also, in the latest patch, I also added checks and logging for *every*
> status returned from ICU. I hope this will help debugging on debian, where
> previous version didn't work. That excessive status checking is hardly be
> necessary once the stuff is better tested.
>
> I think the string copying and heap/palloc choices stands for most of the
> code bloat, together with the excessive status checking and logging.

OK, move that into some common functions and I think it will be better.

> > Why do you need to add a mapping of encoding names from iana to our
> > names?
>
> This was already answered by John Hansen... There's an old thread here
> about the choice of the name "UNICODE" to describe an encoding, which it
> doesn't. There's half a dozen unicode based encodings... UTF-8 is used by
> postgresql, that would have been a better name... Similarly for most other
> encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL
> can't provide it, so a mapping table is required.

We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that
help?

> I use this patch in production on one FreeBSD 4.10 server at the moment.
> With the latest version, I've had no problems. Logging is swithed on for
> now, and it shows no signs of ICU complaining. I'd like more reports on
> Linux, though.

OK, I certainly would like this all done for 8.1 which should have
feature freeze on July 1.

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 13:17:36
Message-ID:	9CC5F10245B7612103CC03F6@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> Palle Girgensohn wrote:
>> >
>> > Is this patch ready for application?
>>
>> I don't think so, not quite. I have not had any positive reports from
>> linux users, this is only tested in a FreeBSD environment. I'd say it
>> needs some more testing.
>
> OK.

John Hansen just reported that it does work on linux. fine!

>> Also, apparently, ICU is installed by default in many linux
>> distributions, and usually it is version 2.8. Some linux users have
>> asked me if there are plans for a patch that works with ICU 2.8. That's
>> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be
>> the stable version, older versions are hard to find on their sites, but
>> most linux distributers seem to consider it too bleeding edge, even
>> gentoo. I don't know why they don't agree.
>
> Good point. Why would linux folks need ICU? Doesn't their OS support
> encodings natively? I am particularly excited about this for OSs that
> don't have such encodings, like UTF8 support for Win32.
>
> Because ICU will not be used unless enabled by configure, it seems we
> are fine with only supporting the newest version. Do Linux users need
> to use ICU for any reason?

There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:

-- without ICU
select upper('Eßer');
upper
-------
EßER
(1 row)

-- with ICU
select upper('Eßer');
upper
-------
ESSER
(1 rad)

This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that way.
Other known example is in Turkish, where an Ì (?) should look different
whether it is an initial letter or not. This fails in standard postgresql
for all platforms.

>> > I do have a few questions:
>> >
>> > Why don't you use the lc_ctype_is_c() part of this test?
>> >
>> > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
>>
>> Um, well, I didn't think about that. :) What would be the locale in
>> this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a
>> wide encoding, indeed. Then the strings will be handled like byte-wide
>> chars. Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a problem
> with some OSs that have processing problems with UTF8 when the locale is
> supposed to be turned off, like in "C". I realize ICU might be fine
> with it but the optimization still is an issue.

Well, the results are quite different, depending on whether ICU is used or
not. See separate mail.

>> > Why is so much code added, for example, in lower()? The existing
>> > multibyte code is much smaller, and lots of code is added in other
>> > places too.
>>
>> ICU uses UTF-16 internally, so all strings must be converted from the
>> database encoding to UTF-16. Since that means the strings need to be
>> copied, I took the same approach as in varlena.c:varstr_cmp(), where
>> small strings use the heap and only larger strings use a palloc.
>> Comments in varstr_cmp about performance made me use that approach.
>
> Oh, interesting. I think you need to create new functions that
> factor out that common code so the patch is smaller and easier to
> maintain.

Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...

>> Also, in the latest patch, I also added checks and logging for *every*
>> status returned from ICU. I hope this will help debugging on debian,
>> where previous version didn't work. That excessive status checking is
>> hardly be necessary once the stuff is better tested.
>>
>> I think the string copying and heap/palloc choices stands for most of
>> the code bloat, together with the excessive status checking and logging.
>
> OK, move that into some common functions and I think it will be better.

Best way for upper/lower/initcap is probably to use a function pointer...
uhh...

>> > Why do you need to add a mapping of encoding names from iana to our
>> > names?
>>
>> This was already answered by John Hansen... There's an old thread here
>> about the choice of the name "UNICODE" to describe an encoding, which it
>> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
>> by postgresql, that would have been a better name... Similarly for most
>> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
>> PostgreSQL can't provide it, so a mapping table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that
> help?

I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence only
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.

>> I use this patch in production on one FreeBSD 4.10 server at the moment.
>> With the latest version, I've had no problems. Logging is swithed on for
>> now, and it shows no signs of ICU complaining. I'd like more reports on
>> Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should have
> feature freeze on July 1.

That shouldn't be a problem.

/Palle

>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania
> 19073

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 13:52:59
Message-ID:	200505071352.j47DqxK28575@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Palle Girgensohn wrote:
> >> Also, apparently, ICU is installed by default in many linux
> >> distributions, and usually it is version 2.8. Some linux users have
> >> asked me if there are plans for a patch that works with ICU 2.8. That's
> >> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be
> >> the stable version, older versions are hard to find on their sites, but
> >> most linux distributers seem to consider it too bleeding edge, even
> >> gentoo. I don't know why they don't agree.
> >
> > Good point. Why would linux folks need ICU? Doesn't their OS support
> > encodings natively? I am particularly excited about this for OSs that
> > don't have such encodings, like UTF8 support for Win32.
> >
> > Because ICU will not be used unless enabled by configure, it seems we
> > are fine with only supporting the newest version. Do Linux users need
> > to use ICU for any reason?
>
>
> There are corner cases where it is impossible to upper/lowercase one
> character at the time. for example:
>
> -- without ICU
> select upper('E?er');
> upper
> -------
> E?ER
> (1 row)
>
> -- with ICU
> select upper('E?er');
> upper
> -------
> ESSER
> (1 rad)
>
> This is because in the standard postgres implementation, upper/lower is
> done one character at the time. A proper upper/lower cannot do it that way.
> Other known example is in Turkish, where an ? (?) should look different
> whether it is an initial letter or not. This fails in standard postgresql
> for all platforms.

Uh, where do you see that? Our code has:

workspace = texttowcs(string);

for (i = 0; workspace[i] != 0; i++)
workspace[i] = towupper(workspace[i]);

result = wcstotext(workspace, i);

> >> Also, in the latest patch, I also added checks and logging for *every*
> >> status returned from ICU. I hope this will help debugging on debian,
> >> where previous version didn't work. That excessive status checking is
> >> hardly be necessary once the stuff is better tested.
> >>
> >> I think the string copying and heap/palloc choices stands for most of
> >> the code bloat, together with the excessive status checking and logging.
> >
> > OK, move that into some common functions and I think it will be better.
>
> Best way for upper/lower/initcap is probably to use a function pointer...
> uhh...

Uh, I don't think so. Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that. I can probably do it if you want.

> >> > Why do you need to add a mapping of encoding names from iana to our
> >> > names?
> >>
> >> This was already answered by John Hansen... There's an old thread here
> >> about the choice of the name "UNICODE" to describe an encoding, which it
> >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
> >> by postgresql, that would have been a better name... Similarly for most
> >> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
> >> PostgreSQL can't provide it, so a mapping table is required.
> >
> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that
> > help?
>
> I'm aware of that. It might help for unicode, but there are a bunch of
> other encodings. IANA has decided that utf-8 has *no* aliases, hence only
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
> fogiving, I don't remember/know, but I think we need the mappings,
> unfortunately.

OK. I guess I am just confused why the native implementations are OK.

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 14:10:30
Message-ID:	0B537F6953FA3B724B5761A7@palle.girgensohn.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> Palle Girgensohn wrote:
>> >> Also, apparently, ICU is installed by default in many linux
>> >> distributions, and usually it is version 2.8. Some linux users have
>> >> asked me if there are plans for a patch that works with ICU 2.8.
>> >> That's probably a good idea. IBM and the ICU folks seem to consider
>> >> 3.2 to be the stable version, older versions are hard to find on
>> >> their sites, but most linux distributers seem to consider it too
>> >> bleeding edge, even gentoo. I don't know why they don't agree.
>> >
>> > Good point. Why would linux folks need ICU? Doesn't their OS support
>> > encodings natively? I am particularly excited about this for OSs that
>> > don't have such encodings, like UTF8 support for Win32.
>> >
>> > Because ICU will not be used unless enabled by configure, it seems we
>> > are fine with only supporting the newest version. Do Linux users need
>> > to use ICU for any reason?
>>
>>
>> There are corner cases where it is impossible to upper/lowercase one
>> character at the time. for example:
>>
>> -- without ICU
>> select upper('E?er');
>> upper
>> -------
>> E?ER
>> (1 row)
>>
>> -- with ICU
>> select upper('E?er');
>> upper
>> -------
>> ESSER
>> (1 rad)
>>
>> This is because in the standard postgres implementation, upper/lower is
>> done one character at the time. A proper upper/lower cannot do it that
>> way. Other known example is in Turkish, where an ? (?) should look
>> different whether it is an initial letter or not. This fails in
>> standard postgresql for all platforms.
>
> Uh, where do you see that? Our code has:
>
> workspace = texttowcs(string);
>
> for (i = 0; workspace[i] != 0; i++)
> workspace[i] = towupper(workspace[i]);

as you see, the loop runs towupper for one character at the time. I cannot
consider whether the letter is the initial, as required in Turkish, and it
cannot really convert one character into two ('ß' -> 'SS')

>
> result = wcstotext(workspace, i);
>
>
>> >> Also, in the latest patch, I also added checks and logging for *every*
>> >> status returned from ICU. I hope this will help debugging on debian,
>> >> where previous version didn't work. That excessive status checking is
>> >> hardly be necessary once the stuff is better tested.
>> >>
>> >> I think the string copying and heap/palloc choices stands for most of
>> >> the code bloat, together with the excessive status checking and
>> >> logging.
>> >
>> > OK, move that into some common functions and I think it will be better.
>>
>> Best way for upper/lower/initcap is probably to use a function
>> pointer... uhh...
>
> Uh, I don't think so. Just send pointers to the the function and let
> the function allocate the memory, and another function to free them, or
> something like that. I can probably do it if you want.

I'll check it out, it seems simple enough.

>> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does
>> > that help?
>>
>> I'm aware of that. It might help for unicode, but there are a bunch of
>> other encodings. IANA has decided that utf-8 has *no* aliases, hence
>> only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
>> fogiving, I don't remember/know, but I think we need the mappings,
>> unfortunately.
>
> OK. I guess I am just confused why the native implementations are OK.

They're OK since they understand that UNICODE (or UTF8) is really utf-8.
Problem is the strings used to describe them are not understood by ICU.

BTW, the pg_enc2iananame_tbl is only used *from* internal representation
*to* IANA, not the other way around. Maybe that fact lowers the rate of
confusion? ;-)

/Palle

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 14:14:41
Message-ID:	200505071414.j47EEfZ02040@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Palle Girgensohn wrote:
> >> This is because in the standard postgres implementation, upper/lower is
> >> done one character at the time. A proper upper/lower cannot do it that
> >> way. Other known example is in Turkish, where an ? (?) should look
> >> different whether it is an initial letter or not. This fails in
> >> standard postgresql for all platforms.
> >
> > Uh, where do you see that? Our code has:
> >
> > workspace = texttowcs(string);
> >
> > for (i = 0; workspace[i] != 0; i++)
> > workspace[i] = towupper(workspace[i]);
>
> as you see, the loop runs towupper for one character at the time. I cannot
> consider whether the letter is the initial, as required in Turkish, and it
> cannot really convert one character into two ('?' -> 'SS')

Oh, OK. I thought texttowcs() would expand the string to allow such
conversions.

> >> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does
> >> > that help?
> >>
> >> I'm aware of that. It might help for unicode, but there are a bunch of
> >> other encodings. IANA has decided that utf-8 has *no* aliases, hence
> >> only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
> >> fogiving, I don't remember/know, but I think we need the mappings,
> >> unfortunately.
> >
> > OK. I guess I am just confused why the native implementations are OK.
>
> They're OK since they understand that UNICODE (or UTF8) is really utf-8.
> Problem is the strings used to describe them are not understood by ICU.
>
> BTW, the pg_enc2iananame_tbl is only used *from* internal representation
> *to* IANA, not the other way around. Maybe that fact lowers the rate of
> confusion? ;-)

OK, got it. I am still a little confused why every native
implementation understands our existing names but ICU does not.