Re: locale

From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-08 15:31:59
Message-ID: Pine.LNX.4.44.0404081729510.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 8 Apr 2004, Tom Lane wrote:

> No, the ordering *will* be the same as it was before, because strcoll()
> is still functioning the same. You'd get the same answer from a sort
> operation since it depends on the same operators.
>
> It interprets them according to LC_CTYPE, which does not change.

I'm afraid that I don't understand you yet, and would like to have
it explained in more detail if possible. While I feel a bit stupid to not
understand what you are stating, but I'm sure there are more then me who
feels like that :-)

Maybe we can look at an example. Let us take some utf-8 strings correctly
ordered in swedish

Åke
Ära

now, since these are utf-8 they are encoded as

c3 85 6b 65 (Åke)
c3 84 72 61 (Ära)

and that is the order they have in the index.

Now, this index is copied into a new database where
the encoding is Latin1. Now we want to in the above table
lookup the string that in Latin1 is represented as

c3 84 72 61

So we look in the index and see that the first row in the index is
not the same. But, now when we compare these strings as latin1 strings
it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings
we compare each character and c3 = c3, and then 84 < 85 (in latin1 84
and 85 are some control characters). Se, we will not find this string
in the index since we think it should have been before the first entry.

We might even insert a new copy of this string in another
position in the index.

So, my question is.

a) What have we gained by copying this table into the latin1 database.
It looks broken to me. As far as I understand we have to rebuild
the index to get something that works at least a little.

b) Maybe one should not just reindex but reencode. In some cases that
works and produces good result. For example from latin1 to utf-8.

c) if we are going to reindex anyway, then why not do that and solve the
per database locale also. This is an independent point from a) and b)
that I still want to understand the first two points even if we don't
talk about per database locale.

--
/Dennis Björklund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-04-08 15:32:19 Re: PostgreSQL configuration
Previous Message Joseph Tate 2004-04-08 15:23:06 Re: PostgreSQL configuration