Re: Patch for collation using ICU

Lists: pgsql-hackers
From: "John Hansen" <john(at)geeknet(dot)com(dot)au>
To: "Tatsuo Ishii" <t-ishii(at)sra(dot)co(dot)jp>
Cc: <alvherre(at)dcc(dot)uchile(dot)cl>, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <girgen(at)pingpong(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch for collation using ICU
Date: 2005-05-08 08:47:25
Message-ID: 5066E5A966339E42AA04BA10BA706AE50A930F@rodrick.geeknet.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii
> Sent: Sunday, May 08, 2005 3:41 PM
> To: John Hansen
> Cc: alvherre(at)dcc(dot)uchile(dot)cl; pgman(at)candle(dot)pha(dot)pa(dot)us;
> girgen(at)pingpong(dot)net; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > Alvaro Herrera wrote:
> > > Sent: Sunday, May 08, 2005 2:49 PM
> > > To: John Hansen
> > > Cc: Tatsuo Ishii; pgman(at)candle(dot)pha(dot)pa(dot)us; girgen(at)pingpong(dot)net;
> > > pgsql-hackers(at)postgresql(dot)org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > >
> > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > > Tatsuo Ishii wrote:
> > >
> > > > > So Japanese(including ASCII)/UNICODE behavior is
> > > perfectly correct
> > > > > at this moment.
> > > >
> > > > Right, so you _never_ use accented ascii characters in
> Japanese?
> > > > (like è for example, whose uppercase is È)
> > >
> > > That isn't ASCII. It's latin1 or some other ASCII extension.
> >
> > Point taken...
> > But...
> >
> > If you want EUC_JP (Japanese + ASCII) then use that as your
> backend encoding, not UTF-8 (unicode).
> > UTF-8 encoded databases are very useful for representing multiple
> > languages in the same database, but this usefulness
> vanishes if functions like upper/lower doesn't work correctly.
>
> I'm just curious if Germany/French/Spanish mixed text can be
> sorted correctly. I think these languages need their own
> locales even with UNICODE/ICU.

No, they will not sort correctly, for that you still need the locale.

>
> > So optimizing for 3 languages breaks more than a hundred,
> that's doesn't seem fair!

That is a compromise I'd be willing to agree on. :)

> Why don't you add a GUC variable or some such to control the
> upper/lower behavior?
> --
> Tatsuo Ishii
>
>


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: john(at)geeknet(dot)com(dot)au
Cc: alvherre(at)dcc(dot)uchile(dot)cl, pgman(at)candle(dot)pha(dot)pa(dot)us, girgen(at)pingpong(dot)net, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-08 13:19:25
Message-ID: 20050508.221925.78726559.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > > > Tatsuo Ishii wrote:
> > > >
> > > > > > So Japanese(including ASCII)/UNICODE behavior is
> > > > perfectly correct
> > > > > > at this moment.
> > > > >
> > > > > Right, so you _never_ use accented ascii characters in
> > Japanese?
> > > > > (like è for example, whose uppercase is È)
> > > >
> > > > That isn't ASCII. It's latin1 or some other ASCII extension.
> > >
> > > Point taken...
> > > But...
> > >
> > > If you want EUC_JP (Japanese + ASCII) then use that as your
> > backend encoding, not UTF-8 (unicode).
> > > UTF-8 encoded databases are very useful for representing multiple
> > > languages in the same database, but this usefulness
> > vanishes if functions like upper/lower doesn't work correctly.
> >
> > I'm just curious if Germany/French/Spanish mixed text can be
> > sorted correctly. I think these languages need their own
> > locales even with UNICODE/ICU.
>
> No, they will not sort correctly, for that you still need the locale.

I'm confused. I thought the ICU patches is intended for using on
broken locale platforms?
--
Tatsuo Ishii


From: Palle Girgensohn <girgen(at)pingpong(dot)net>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, john(at)geeknet(dot)com(dot)au
Cc: alvherre(at)dcc(dot)uchile(dot)cl, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-08 13:33:59
Message-ID: B59D1248203A8B40C12D7B98@palle.girgensohn.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On söndag, maj 08, 2005 22.19.25 +0900 Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
wrote:

>> > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
>> > > > > Tatsuo Ishii wrote:
>> > > >
>> > > > > > So Japanese(including ASCII)/UNICODE behavior is
>> > > > perfectly correct
>> > > > > > at this moment.
>> > > > >
>> > > > > Right, so you _never_ use accented ascii characters in
>> > Japanese?
>> > > > > (like è for example, whose uppercase is È)
>> > > >
>> > > > That isn't ASCII. It's latin1 or some other ASCII extension.
>> > >
>> > > Point taken...
>> > > But...
>> > >
>> > > If you want EUC_JP (Japanese + ASCII) then use that as your
>> > backend encoding, not UTF-8 (unicode).
>> > > UTF-8 encoded databases are very useful for representing multiple
>> > > languages in the same database, but this usefulness
>> > vanishes if functions like upper/lower doesn't work correctly.
>> >
>> > I'm just curious if Germany/French/Spanish mixed text can be
>> > sorted correctly. I think these languages need their own
>> > locales even with UNICODE/ICU.
>>
>> No, they will not sort correctly, for that you still need the locale.
>
> I'm confused. I thought the ICU patches is intended for using on
> broken locale platforms?

It will sort correctly in *one* locale, using ICU. You still cannot mix
different locales in the same database cluster, the collation locale is
still fixed at initdb time, unfortunately.

/Palle


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, john(at)geeknet(dot)com(dot)au, alvherre(at)dcc(dot)uchile(dot)cl, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-08 16:46:47
Message-ID: 12273.1115570807@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Palle Girgensohn <girgen(at)pingpong(dot)net> writes:
>> I'm confused. I thought the ICU patches is intended for using on
>> broken locale platforms?

> It will sort correctly in *one* locale, using ICU. You still cannot mix
> different locales in the same database cluster, the collation locale is
> still fixed at initdb time, unfortunately.

I thought the point of using ICU was to be able to dig out from under
that restriction? It's a bit of a large pill to swallow if we will
still have to throw it away someday to become SQL spec compliant.

regards, tom lane


From: "Palle Girgensohn" <girgen(at)pingpong(dot)net>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Palle Girgensohn" <girgen(at)pingpong(dot)net>, "Tatsuo Ishii" <t-ishii(at)sra(dot)co(dot)jp>, john(at)geeknet(dot)com(dot)au, alvherre(at)dcc(dot)uchile(dot)cl, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-08 22:31:26
Message-ID: 47900.62.148.39.163.1115591486.squirrel@62.148.39.163
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Palle Girgensohn <girgen(at)pingpong(dot)net> writes:
>>> I'm confused. I thought the ICU patches is intended for using on
>>> broken locale platforms?
>
>> It will sort correctly in *one* locale, using ICU. You still cannot mix
>> different locales in the same database cluster, the collation locale is
>> still fixed at initdb time, unfortunately.
>
> I thought the point of using ICU was to be able to dig out from under
> that restriction?

I think it might be quite possible to mix several locales, using ICU. It's
just that this is not what the patch does at moment. It just finds out the
locale set at initdb and uses it for collation with ICU.

Handling mixed locales for collation has a few hard problems, AFAIK.
First, isn't the main obstacle for mixing collations that indices require
a single well defined locale? I assume that locale dependant comparison
(collation) is used when indexing tuples, right? As long as a specific
locales collation is used for indexing text fields, I believe we cannot
easily mix different locales, right? Second, how do we tell the backend
which locale to use? Is there some SQL spec for this?

> It's a bit of a large pill to swallow if we will still
> have to throw it away someday to become SQL spec compliant.

What do we need to be SQL spec compliant in this respect?

/Palle