Re: locale

Lists: pgsql-hackers
From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: locale
Date: 2004-04-07 11:04:09
Message-ID: Pine.LNX.4.44.0404071257590.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Is anyone working to make the locale support in pg better? Running initdb
to set the locale is a bit heavy. It would be nice to at least be able to
set it per database.

--
/Dennis Björklund


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 17:17:21
Message-ID: 200404071717.i37HHLj11633@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund wrote:
> Is anyone working to make the locale support in pg better? Running initdb
> to set the locale is a bit heavy. It would be nice to at least be able to
> set it per database.

Uh, createdb and CREATE DATABASE both have encoding options. initdb
only sets the encoding for template1, and the default for future
databases, but you can override it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 17:28:45
Message-ID: 200404071728.i37HSjI13944@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund wrote:
> Is anyone working to make the locale support in pg better? Running initdb
> to set the locale is a bit heavy. It would be nice to at least be able to
> set it per database.

Oops, I confused locale and multibyte. Yes, I don't see a way to change
locale for new databases, but I don't see why we can't.

The initdb manual page says:

initdb initializes the database cluster's default locale
and character set encoding. Some locale categories are
fixed for the lifetime of the cluster, so it is important
to make the right choice when running initdb. Other
locale categories can be changed later when the server is
started. initdb will write those locale settings into the
postgresql.conf configuration file so they are the
default, but they can be changed by editing that file. To
set the locale that initdb uses, see the description of
the --locale option. The character set encoding can be set
separately for each database as it is created. initdb
determines the encoding for the template1 database, which
will serve as the default for all other databases. To
alter the default encoding use the --encoding option.

and

--locale=locale
Sets the default locale for the database cluster.
If this option is not specified, the locale is
inherited from the environment that initdb runs in.

--lc-collate=locale

--lc-ctype=locale

--lc-messages=locale

--lc-monetary=locale

--lc-numeric=locale

--lc-time=locale
Like --locale, but only sets the locale in the
specified category.

My only guess is that you can use ALTER DATABASE SET to set some of the
values when someone connects to the database.

Looking at guc.c I see:

{
{"lc_collate", PGC_INTERNAL, CLIENT_CONN_LOCALE,
gettext_noop("Shows the collation order locale."),
NULL,
GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
},
&locale_collate,
"C", NULL, NULL
},

{
{"lc_ctype", PGC_INTERNAL, CLIENT_CONN_LOCALE,
gettext_noop("Shows the character classification and case conversion locale."),
NULL,
GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
},
&locale_ctype,
"C", NULL, NULL
},

{
{"lc_messages", PGC_SUSET, CLIENT_CONN_LOCALE,
gettext_noop("Sets the language in which messages are displayed."),
NULL
},
&locale_messages,
"", locale_messages_assign, NULL
},

{
{"lc_monetary", PGC_USERSET, CLIENT_CONN_LOCALE,
gettext_noop("Sets the locale for formatting monetary amounts."),
NULL
},
&locale_monetary,
"C", locale_monetary_assign, NULL
},

{
{"lc_numeric", PGC_USERSET, CLIENT_CONN_LOCALE,
gettext_noop("Sets the locale for formatting numbers."),
NULL
},
&locale_numeric,
"C", locale_numeric_assign, NULL
},

{
{"lc_time", PGC_USERSET, CLIENT_CONN_LOCALE,
gettext_noop("Sets the locale for formatting date and time values."),
NULL
},
&locale_time,
"C", locale_time_assign, NULL
},

You can't change the internal ones, but you can modify some of the others.

Anyone know why we don't allow locale to be set per database?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 17:31:58
Message-ID: 200404071931.58121.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund wrote:
> Is anyone working to make the locale support in pg better? Running
> initdb to set the locale is a bit heavy. It would be nice to at least
> be able to set it per database.

I was supposed to do that but I got distracted. I send out a longish
implementation and transition plan some time ago, if you're interested.
Setting the locale per database is quite doable actually, you just need
a plan to prevent corruption of the shared system catalogs and you need
to deal with modifications of the template database(s). There was some
discussion about that as well. See the thread "Translations in the
distributions" around 2004-01-09. I can help out if you want to do
what was discussed there.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 17:52:57
Message-ID: 40743FF9.1090708@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

>Dennis Bjorklund wrote:
>
>
>>Is anyone working to make the locale support in pg better? Running initdb
>>to set the locale is a bit heavy. It would be nice to at least be able to
>>set it per database.
>>
>>
>
>Uh, createdb and CREATE DATABASE both have encoding options. initdb
>only sets the encoding for template1, and the default for future
>databases, but you can override it.
>
>
>

That is true for encoding, but not true for LC_CTYPE and LC_COLLATE
locale settings, which only initdb can set.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 18:29:16
Message-ID: 25045.1081362556@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Anyone know why we don't allow locale to be set per database?

Changing it on the fly would corrupt index sort ordering. See also
Peter's response nearby.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 18:34:37
Message-ID: 200404071834.i37IYb124771@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Anyone know why we don't allow locale to be set per database?
>
> Changing it on the fly would corrupt index sort ordering. See also
> Peter's response nearby.

I was asking why we can't set it to a new static value when we create
the database. I don't think the poster was asking for the ability to
change it after the database was created.

Added to TODO:

* Allow locale to be set at database creation

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 18:40:41
Message-ID: 25877.1081363241@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> I was asking why we can't set it to a new static value when we create
> the database.

Because that would corrupt indexes on shared tables. (It might be
possible to finesse that, but it's not a no-brainer.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 18:42:33
Message-ID: 26192.1081363353@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I said:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> I was asking why we can't set it to a new static value when we create
>> the database.

> Because that would corrupt indexes on shared tables. (It might be
> possible to finesse that, but it's not a no-brainer.)

And even more to the point, it would corrupt non-shared indexes
inherited from template1. This could not be finessed --- AFAICS you'd
need to do the equivalent of a REINDEX in the new database to make it
work.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 18:42:40
Message-ID: 200404071842.i37IgeG00434@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > I was asking why we can't set it to a new static value when we create
> > the database.
>
> Because that would corrupt indexes on shared tables. (It might be
> possible to finesse that, but it's not a no-brainer.)

Oh, I hadn't thought of that. The problem isn't encoding, because we
handle that already, but differen representations of time and stuff?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 18:57:53
Message-ID: 26413.1081364273@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> Because that would corrupt indexes on shared tables. (It might be
>> possible to finesse that, but it's not a no-brainer.)

> Oh, I hadn't thought of that. The problem isn't encoding, because we
> handle that already, but differen representations of time and stuff?

No, the problem is sort ordering of indexes on textual columns.

regards, tom lane


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-07 18:58:52
Message-ID: Pine.LNX.4.44.0404072049480.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 7 Apr 2004, Tom Lane wrote:

> And even more to the point, it would corrupt non-shared indexes
> inherited from template1. This could not be finessed --- AFAICS you'd
> need to do the equivalent of a REINDEX in the new database to make it
> work.

From what I can tell there is only 3 tables we talk about:

pg_database
pg_shadow
pg_group

and in each case there is the name column that is indexed (that matters to
us, int columns are the same no matter what locale).

These name columns all use the special name datatype, maybe one could
simply treat name differently, like comparing these strings bytewise.

For my small databases I don't even need an index on any of these. But I
can imaging someone having a couple of thousand users.

--
/Dennis Björklund


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 19:00:57
Message-ID: 26478.1081364457@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Added to TODO:
> * Allow locale to be set at database creation

BTW, that is redundant with the locale todo items already present.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 19:07:03
Message-ID: 26549.1081364823@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> From what I can tell there is only 3 tables we talk about:
> pg_database
> pg_shadow
> pg_group

If that were so, we'd not have a problem. The reason we have to tread
very carefully is that we do not know what tables/indexes users might
have added to template1. If we copy a text index into a new database
and claim that it is sorted by some new locale, we'd be breaking things.

In any case, the whole idea is substantially inferior to the correct
solution, which is per-column locale settings within databases. That
does what we want, is required functionality per SQL spec, and avoids
problems during CREATE DATABASE. It's just a tad harder to do :-(

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 19:14:43
Message-ID: 200404071914.i37JEhG06512@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Added to TODO:
> > * Allow locale to be set at database creation
>
> BTW, that is redundant with the locale todo items already present.

I see:

* Allow locale to be set at database creation
* Allow locale on a per-column basis, default to ASCII

The first seems easier than the second.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-07 19:14:57
Message-ID: Pine.LNX.4.44.0404072107590.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 7 Apr 2004, Tom Lane wrote:

> If that were so, we'd not have a problem. The reason we have to tread
> very carefully is that we do not know what tables/indexes users might
> have added to template1.

Aah, now I see the real problem!

> If we copy a text index into a new database and claim that it is sorted
> by some new locale, we'd be breaking things.

How is this handled for encodings? You can very well have something in
template1 in an encoding that is not compatible with the encoding you use
to create a new database.

Right now I can't imagine how that was solved.

> In any case, the whole idea is substantially inferior to the correct
> solution, which is per-column locale settings within databases.

Of course, but that solution might be many years ahead. Had it been fairly
easy to create a database with a different locale it would have been
worth it (and still is if one could come up with some solution).

I have a number of different data directories with different locales, and
add to that a number of different versions of pg and you can imagine
what it looks like when I run ps :-)

--
/Dennis Björklund


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 19:36:44
Message-ID: 4074584C.8020007@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
>
>
>>From what I can tell there is only 3 tables we talk about:
>> pg_database
>> pg_shadow
>> pg_group
>>
>>
>
>If that were so, we'd not have a problem. The reason we have to tread
>very carefully is that we do not know what tables/indexes users might
>have added to template1. If we copy a text index into a new database
>and claim that it is sorted by some new locale, we'd be breaking things.
>
>

Wouldn't reindex correct that? If so, it could be forced with a flag on
"create database" maybe, or else some test to compare the two locale
settings and force it if necessary?

>In any case, the whole idea is substantially inferior to the correct
>solution, which is per-column locale settings within databases. That
>does what we want, is required functionality per SQL spec, and avoids
>problems during CREATE DATABASE. It's just a tad harder to do :-(
>
>

Yeah. But everything higher than the table level can surely be finessed
with differrent locations / databases. Not having this right (i.e. at
the column level) is a great pity, to say the least.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 19:40:57
Message-ID: 26889.1081366857@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> On Wed, 7 Apr 2004, Tom Lane wrote:
>> If we copy a text index into a new database and claim that it is sorted
>> by some new locale, we'd be breaking things.

> How is this handled for encodings? You can very well have something in
> template1 in an encoding that is not compatible with the encoding you use
> to create a new database.

This is likely broken; but that's no excuse for creating similar
breakage for locale settings. Note that Peter's planned project would
hopefully clean up both of these issues.

In practice, we know that we have seen index failures from altering the
locale settings (back before we installed the code that locks down
LC_COLLATE/LC_CTYPE at initdb time). I do not recall having heard any
reports of index failures that could be traced to changing encoding.
This may be because strcoll() derives its assumptions about encoding
from the LC_CTYPE setting and doesn't actually know what PG thinks the
encoding is. So you might have a stored string that is illegal per the
current encoding, but nonetheless it will sort the same as it did in the
mother database.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 20:03:03
Message-ID: 27116.1081368183@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> On Wed, 7 Apr 2004, Tom Lane wrote:
>> In any case, the whole idea is substantially inferior to the correct
>> solution, which is per-column locale settings within databases.

> Of course, but that solution might be many years ahead.

Peter E. seems to think that it's not an infeasible amount of work.
(See previous discussion that he mentioned earlier in this thread.)

Basically, I'd rather see us tackle that than expend effort on
kluging CREATE DATABASE to allow per-database locales.

regards, tom lane


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-07 20:55:24
Message-ID: Pine.LNX.4.44.0404072246320.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 7 Apr 2004, Tom Lane wrote:

> >> solution, which is per-column locale settings within databases.
>
> > Of course, but that solution might be many years ahead.
>
> Peter E. seems to think that it's not an infeasible amount of work.
> (See previous discussion that he mentioned earlier in this thread.)

I don't know how it should work in theory yet, much less what an
implementation would look like.

What happens when you have two columns with different locales and try to
compare them with with the operator <. Is the locale part of the string
type, like text(at)sv_SE(dot)UTF-8(dot) What does that do to overloaded functions.
What would happen when a locale and an encoding does not match. Should one
just assume that it wont happen.

I've got lots of questions like that, some are probably answered by the
sql standard and others maybe don't have an answer.

> Basically, I'd rather see us tackle that than expend effort on
> kluging CREATE DATABASE to allow per-database locales.

Don't think for a second that I don't want this. You are an american that
live in a ASCII world and you wants this. You can not imagine how much I
want it :-)

--
/Dennis Björklund


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: db(at)zigo(dot)dhs(dot)org
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-07 23:22:37
Message-ID: 20040408.082237.74752799.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> On Wed, 7 Apr 2004, Tom Lane wrote:
>
> > If that were so, we'd not have a problem. The reason we have to tread
> > very carefully is that we do not know what tables/indexes users might
> > have added to template1.
>
> Aah, now I see the real problem!
>
> > If we copy a text index into a new database and claim that it is sorted
> > by some new locale, we'd be breaking things.
>
> How is this handled for encodings? You can very well have something in
> template1 in an encoding that is not compatible with the encoding you use
> to create a new database.

Are you talking about the sort order? Then there's no problem with
encoding itself.
--
Tatsuo Ishii


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-08 05:09:38
Message-ID: Pine.LNX.4.44.0404080708120.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Apr 2004, Tatsuo Ishii wrote:

> Are you talking about the sort order? Then there's no problem with
> encoding itself.

The tables in template1 in encoding E1 are compied into the new database
in encoding E2. Not all encodings are compatable, so you can't even
convert from E1 to E2.

--
/Dennis Björklund


From: Honza Pazdziora <adelton(at)informatics(dot)muni(dot)cz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-08 07:09:44
Message-ID: 20040408070944.GC3994@anxur.fi.muni.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 07, 2004 at 03:40:57PM -0400, Tom Lane wrote:
>
> In practice, we know that we have seen index failures from altering the
> locale settings (back before we installed the code that locks down
> LC_COLLATE/LC_CTYPE at initdb time). I do not recall having heard any

Cannot the same failure happen if one upgrades their glibc / locales
and the new version implements the locale differently? Perhaps fixing
previous bug, or simply producing different results for strcoll /
strxfrm? If PostgreSQL depends on external locale information for
something as important as indexes, shouldn't it make elementary checks
(upon startup, perhaps) that the current locale settings and the
current locale version produces results compatible with the existing
indexes? And if it does not, reindex?

--
------------------------------------------------------------------------
Honza Pazdziora | adelton(at)fi(dot)muni(dot)cz | http://www.fi.muni.cz/~adelton/
.project: Perl, mod_perl, DBI, Oracle, large Web systems, XML/XSL, ...
Only self-confident people can be simple.


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: db(at)zigo(dot)dhs(dot)org
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-08 07:22:10
Message-ID: 20040408.162210.48533966.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > Are you talking about the sort order? Then there's no problem with
> > encoding itself.
>
> The tables in template1 in encoding E1 are compied into the new database
> in encoding E2. Not all encodings are compatable, so you can't even
> convert from E1 to E2.

In this case you just set your terminal encoding to E1, then SELECT
the table. Point is you do not use set client_encoding or \encoding
command. This will work as long as both E1 and E2 are single byte
encodings.
--
Tatsuo Ishii


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-08 07:31:57
Message-ID: Pine.LNX.4.44.0404080922421.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Apr 2004, Tatsuo Ishii wrote:

> > The tables in template1 in encoding E1 are compied into the new database
> > in encoding E2. Not all encodings are compatable, so you can't even
> > convert from E1 to E2.
>
> In this case you just set your terminal encoding to E1, then SELECT
> the table. Point is you do not use set client_encoding or \encoding
> command. This will work as long as both E1 and E2 are single byte
> encodings.

That is not a solution.

As you said, it does not even work for all encodings. If the database is
in Latin1 I'd expect that the strings in the table are just latin1 and not
something else. And for some multibyte encodings that something else might
not just be the wrong characters but an invalid string (think utf-8).

I can also imagine the indexes being wrong when you keep the encoding of
tables when you create a new database. Since the same character can be
represented differently, the sort order also changes if you try to
interpret something with another encoding then what the compare operator
think it is. That makes the index invalid.

It's simply broken if you ask me.

--
/Dennis Björklund


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-08 13:56:06
Message-ID: 13283.1081432566@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> I can also imagine the indexes being wrong when you keep the encoding of
> tables when you create a new database. Since the same character can be
> represented differently, the sort order also changes if you try to
> interpret something with another encoding then what the compare operator
> think it is. That makes the index invalid.

See my previous point: the index does not actually fail, in our current
implementation, because strcoll() is unaffected by the database's
encoding setting. You'd be likely to have trouble with I/O translation
and with other encoding-dependent operations like upper()/lower() ...
but not with indexes.

> It's simply broken if you ask me.

It's certainly ungood, but I don't think we can materially improve
things without a fundamental rewrite along the lines of Peter's proposal
to support per-column locale/encoding. Database-level settings are
simply the wrong tool for this.

regards, tom lane


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-08 14:16:11
Message-ID: Pine.LNX.4.44.0404081559430.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Apr 2004, Tom Lane wrote:

> See my previous point: the index does not actually fail, in our current
> implementation, because strcoll() is unaffected by the database's
> encoding setting.

How can it be? If I have a utf-8 template1 and a table with an index
sorted according to the utf-8 characters in some locale. Then this table
and index is copied into a Latin1 database. When I interpret these bytes
as Latin1 in the index, the ordering does not have to be the same as it
was before and the index can not be used.

I don't understand what you mean when you say that strcoll() is unaffected
by the database's encoding setting. It interprets characters, how can it
not be?

If it works it must be something more going on that I don't
know/understand yet. If I am I would be happy to be corrected, if not we
have a more broken system then we expected before.

The objection to a per database locale is that we can not copy a table
from the template into the database since the index would not be valid
anymore. To me that is solvable by just reindexing. The current problem
with encodings does not look solvable at all to me (except to not copy
tables when we can not reencode the strings).

--
/Dennis Björklund


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-08 14:39:29
Message-ID: 13779.1081435169@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> On Thu, 8 Apr 2004, Tom Lane wrote:
>> See my previous point: the index does not actually fail, in our current
>> implementation, because strcoll() is unaffected by the database's
>> encoding setting.

> How can it be? If I have a utf-8 template1 and a table with an index
> sorted according to the utf-8 characters in some locale. Then this table
> and index is copied into a Latin1 database. When I interpret these bytes
> as Latin1 in the index, the ordering does not have to be the same as it
> was before and the index can not be used.

No, the ordering *will* be the same as it was before, because strcoll()
is still functioning the same. You'd get the same answer from a sort
operation since it depends on the same operators.

Now, you will probably complain that the sort order doesn't appear
correct according to your Latin1 interpretation --- and you're right.
But the index is not corrupt, it is still consistent in its own terms.

> I don't understand what you mean when you say that strcoll() is unaffected
> by the database's encoding setting. It interprets characters, how can it
> not be?

It interprets them according to LC_CTYPE, which does not change.

regards, tom lane


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-08 15:31:59
Message-ID: Pine.LNX.4.44.0404081729510.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Apr 2004, Tom Lane wrote:

> No, the ordering *will* be the same as it was before, because strcoll()
> is still functioning the same. You'd get the same answer from a sort
> operation since it depends on the same operators.
>
> It interprets them according to LC_CTYPE, which does not change.

I'm afraid that I don't understand you yet, and would like to have
it explained in more detail if possible. While I feel a bit stupid to not
understand what you are stating, but I'm sure there are more then me who
feels like that :-)

Maybe we can look at an example. Let us take some utf-8 strings correctly
ordered in swedish

Åke
Ära

now, since these are utf-8 they are encoded as

c3 85 6b 65 (Åke)
c3 84 72 61 (Ära)

and that is the order they have in the index.

Now, this index is copied into a new database where
the encoding is Latin1. Now we want to in the above table
lookup the string that in Latin1 is represented as

c3 84 72 61

So we look in the index and see that the first row in the index is
not the same. But, now when we compare these strings as latin1 strings
it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings
we compare each character and c3 = c3, and then 84 < 85 (in latin1 84
and 85 are some control characters). Se, we will not find this string
in the index since we think it should have been before the first entry.

We might even insert a new copy of this string in another
position in the index.

So, my question is.

a) What have we gained by copying this table into the latin1 database.
It looks broken to me. As far as I understand we have to rebuild
the index to get something that works at least a little.

b) Maybe one should not just reindex but reencode. In some cases that
works and produces good result. For example from latin1 to utf-8.

c) if we are going to reindex anyway, then why not do that and solve the
per database locale also. This is an independent point from a) and b)
that I still want to understand the first two points even if we don't
talk about per database locale.

--
/Dennis Björklund


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-08 15:47:25
Message-ID: 14589.1081439245@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> On Thu, 8 Apr 2004, Tom Lane wrote:
>> No, the ordering *will* be the same as it was before, because strcoll()
>> is still functioning the same. You'd get the same answer from a sort
>> operation since it depends on the same operators.

> But, now when we compare these strings as latin1 strings
> it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings
> we compare each character and c3 = c3, and then 84 < 85 (in latin1 84
> and 85 are some control characters).

You're missing the point: strcoll() is not going to compare them as
latin1 strings. It's going to interpret the bytes as utf-8 strings,
because that's what LC_CTYPE will tell it to do. So the sort ordering
of any particular byte string remains the same as it was before, and
the index does not become corrupt.

Whether the index is delivering answers that you find useful is a whole
different question ;-). For example, if you do a "WHERE col = 'foo'"
type of query, you'll be presenting the latin1 encoding of 'foo', which
may well not equal the utf-8 encoding of 'foo', meaning you won't find
that row even if it exists. However this would be true whether you used
the index or not --- it's really a data failure and not an index failure.

> a) What have we gained by copying this table into the latin1 database.
> It looks broken to me.

It looks broken to me too, in terms of user functionality. I was simply
responding to your assertion that the indexes will be corrupt. They
won't be.

AFAICS, to support per-database encoding and locale correctly, CREATE
DATABASE would have to be prepared to re-encode *and* re-index every
textual column in the copied database. I don't really foresee us going
to that much work in order to have a solution that's still half-baked
and non-spec-compliant. It's much more likely that per-column locale
and encoding will get done instead.

regards, tom lane


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-08 19:19:32
Message-ID: Pine.LNX.4.44.0404082109340.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Apr 2004, Tom Lane wrote:

> You're missing the point: strcoll() is not going to compare them as
> latin1 strings. It's going to interpret the bytes as utf-8 strings,
> because that's what LC_CTYPE will tell it to do.

My current understanding of what you are saying now is that LC_CTYPE is
always UTF-8 and all comparisons in the new database are going to be
wrong. This since all strings will be compared as if they where UTF-8.
LC_CTYPE is per cluster and not per database as some of the other LC_xxxx.

Yes, this actually makes sense. I really hope that this is the way it work
because I think I can understand this. I don't like it, but I can
understand what pg currently do, which is good (unless pg does something
else :-)

Thanks for the explanation.

--
/Dennis Björklund


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-09 03:25:01
Message-ID: 20298.1081481101@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> LC_CTYPE is per cluster and not per database as some of the other LC_xxxx.

Yup, exactly. If we did not force both LC_COLLATE and LC_CTYPE to have
the same values cluster-wide, then we *would* have index corruption
issues.

regards, tom lane


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: locale
Date: 2004-04-09 05:18:33
Message-ID: Pine.LNX.4.44.0404090712480.4551-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Apr 2004, Tom Lane wrote:

> Yup, exactly. If we did not force both LC_COLLATE and LC_CTYPE to have
> the same values cluster-wide, then we *would* have index corruption
> issues.

We really show warn people that using another encoding in a database then
what the cluster uses, breaks sorting.

I was under the impression that as long as I've set the right locale when
doing initdb I could then create different databases with different
encodings and it all works, but it does not. I simply trust pg too much
(not without reason since it is an amazing project).

--
/Dennis Björklund


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-12 17:31:39
Message-ID: 200404121931.39344.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> It's certainly ungood, but I don't think we can materially improve
> things without a fundamental rewrite along the lines of Peter's
> proposal to support per-column locale/encoding. Database-level
> settings are simply the wrong tool for this.

Well, the complete redo is about two years out if you ask me. Allowing
the locale to be set on a database level would already improve things a
lot for many people. Since we have a perfectly good reindex command, I
think the problems that we have discussed are not showstoppers.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: locale
Date: 2004-04-12 18:02:26
Message-ID: 200404121802.i3CI2Q719051@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Tom Lane wrote:
> > It's certainly ungood, but I don't think we can materially improve
> > things without a fundamental rewrite along the lines of Peter's
> > proposal to support per-column locale/encoding. Database-level
> > settings are simply the wrong tool for this.
>
> Well, the complete redo is about two years out if you ask me. Allowing
> the locale to be set on a database level would already improve things a
> lot for many people. Since we have a perfectly good reindex command, I
> think the problems that we have discussed are not showstoppers.

I added a TODO item for fixing per-database locales, so we are ready if
someone wants to code it:

o Allow locale to be set at database creation

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073