Re: WIP patch: Collation support

Lists: pgsql-hackers
From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
Subject: WIP patch: Collation support
Date: 2008-08-31 21:32:30
Message-ID: de5165440808311432t5618efb5j2f7b3c0f62da8c01@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm implementing collation support. Proposal can be found at
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00557.php

Progress so far:
- created catalogs pg_collation a pg_charset which are filled with three
standard collations
- initdb changes rows called "DEFAULT" in both catalogs during the bki
bootstrap phase with current system LC_COLLATE and LC_CTYPE or those set by
command line.
- new collations can be defined with command CREATE COLLATION <collation
name> FOR <character set specification> FROM <existing collation name>
[STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ] [
LCCTYPE <lc_ctype> ]
- because of pg_collation and pg_charset are catalogs individual for each
database, if you want to create a database with collation other than
specified, create it in template1 and then create database
- when connecting to database, it retrieves locales from pg_database and
sets them

Design & functionality changes left:
- move retrieveing collation from pg_database to pg_type
- get case sensitivity and pad characteristic working
- wrap locale setting and text comparison into functions that will be
referenced from pg_collation as regproc (make the code nice)
- when creating database with different collation than database cluster, the
database has to be reindexed. Any idea how to do it? Function
ReindexDatabase works only when database is opened.
- minor tweaks & fixes

Ideas? Comments?

Regards

Radek Strnad

Attachment Content-Type Size
wip-collation.patch text/x-diff 79.9 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-08-31 22:08:40
Message-ID: 20080831220840.GC3856@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Radek Strnad escribió:

> - when creating database with different collation than database cluster, the
> database has to be reindexed. Any idea how to do it? Function
> ReindexDatabase works only when database is opened.

We have this Todo item:

Set proper permissions on non-system schemas during db creation
Currently all schemas are owned by the super-user because they are
copied from the template1 database. However, since all objects are
inherited from the template database, it is not clear that setting
schemas to the db owner is correct.

When this was discussed years ago, one proposed idea was that on the
first connection the backend should, as a first task, ensure that
certain administrative chores be done. The first of those was changing
the ownership of schemas. It sounds like this reindexing you propose
falls into the same category.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-01 09:01:39
Message-ID: de5165440809010201q24129262v807adef65b613752@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ok, so do you suggest to leave it with a notice "reindex database" or start
to solve it somehow?

Regards

Radek Strnad

On Mon, Sep 1, 2008 at 12:08 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com>wrote:

> Radek Strnad escribió:
>
> > - when creating database with different collation than database cluster,
> the
> > database has to be reindexed. Any idea how to do it? Function
> > ReindexDatabase works only when database is opened.
>
> We have this Todo item:
>
> Set proper permissions on non-system schemas during db creation
> Currently all schemas are owned by the super-user because they are
> copied from the template1 database. However, since all objects are
> inherited from the template database, it is not clear that setting
> schemas to the db owner is correct.
>
> When this was discussed years ago, one proposed idea was that on the
> first connection the backend should, as a first task, ensure that
> certain administrative chores be done. The first of those was changing
> the ownership of schemas. It sounds like this reindexing you propose
> falls into the same category.
>
> --
> Alvaro Herrera
> http://www.CommandPrompt.com/
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 11:50:47
Message-ID: 48BD2897.1020403@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Radek Strnad wrote:
> - new collations can be defined with command CREATE COLLATION <collation
> name> FOR <character set specification> FROM <existing collation name>
> [STRCOLFN <fn name>]
> [ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ]
> [ LCCTYPE <lc_ctype> ]

How do you plan to make a collation case sensitive or accent sensitive?
I have previously commented that this is not a realistic view on how
collations work. Since you are apparently planning to use the system
locales, I don't see how you can make this work.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 12:00:08
Message-ID: 20080902120008.GE3664@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
> Radek Strnad wrote:
> >- new collations can be defined with command CREATE COLLATION <collation
> >name> FOR <character set specification> FROM <existing collation name>
> >[STRCOLFN <fn name>]
> >[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ]
> >[ LCCTYPE <lc_ctype> ]
>
> How do you plan to make a collation case sensitive or accent sensitive?
> I have previously commented that this is not a realistic view on how
> collations work. Since you are apparently planning to use the system
> locales, I don't see how you can make this work.

While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 12:02:42
Message-ID: de5165440809020502g191523c1uf4c5f09a22d84a0c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think at least case sensitivity can be done by comparing two strings
converted to upper case with toupper() function.

Regards

Radek Strnad

>
> On Tue, Sep 2, 2008 at 2:00 PM, Martijn van Oosterhout <kleptog(at)svana(dot)org>wrote:
>
>> On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
>> > Radek Strnad wrote:
>> > >- new collations can be defined with command CREATE COLLATION
>> <collation
>> > >name> FOR <character set specification> FROM <existing collation name>
>> > >[STRCOLFN <fn name>]
>> > >[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate>
>> ]
>> > >[ LCCTYPE <lc_ctype> ]
>> >
>> > How do you plan to make a collation case sensitive or accent sensitive?
>> > I have previously commented that this is not a realistic view on how
>> > collations work. Since you are apparently planning to use the system
>> > locales, I don't see how you can make this work.
>>
>> While it's true POSIX locales don't handle this, other collation
>> libraries do and we should support them if the user wants.
>>
>> Have a nice day,
>> --
>> Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
>> > Please line up in a tree and maintain the heap invariant while
>> > boarding. Thank you for flying nlogn airlines.
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.6 (GNU/Linux)
>>
>> iD8DBQFIvSrIIB7bNG8LQkwRAnkWAJ9FaiR9cOHFN2vkVmQaK5y7N9OJoQCbB+Ks
>> e0E4722hY/Q+Cz8tpzA0CGs=
>> =2Svh
>> -----END PGP SIGNATURE-----
>>
>>
>


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 13:46:16
Message-ID: 48BD43A8.50807@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
>> Radek Strnad wrote:
>>> - new collations can be defined with command CREATE COLLATION <collation
>>> name> FOR <character set specification> FROM <existing collation name>
>>> [STRCOLFN <fn name>]
>>> [ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ]
>>> [ LCCTYPE <lc_ctype> ]
>> How do you plan to make a collation case sensitive or accent sensitive?
>> I have previously commented that this is not a realistic view on how
>> collations work. Since you are apparently planning to use the system
>> locales, I don't see how you can make this work.
>
> While it's true POSIX locales don't handle this, other collation
> libraries do and we should support them if the user wants.

Do they handle exactly those two attributes specifically? Can you point
out references? Or do you mean, other collation libraries allow their
collations to be configured/customized? I think linguistically it is a
very narrow view of the world to hardcode those two attributes.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 15:57:27
Message-ID: 20080902155727.GF3664@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
> >While it's true POSIX locales don't handle this, other collation
> >libraries do and we should support them if the user wants.
>
> Do they handle exactly those two attributes specifically? Can you point
> out references? Or do you mean, other collation libraries allow their
> collations to be configured/customized? I think linguistically it is a
> very narrow view of the world to hardcode those two attributes.

Well, yes. Accents and case are attributes of a character. (I'm using
the unicode model here). So, to do a case insensetive match you take
the characters, strip the attributes and then do the comparison. There
are specialised routines which handle the denormalisation of the string
for you so in theory you could even get specific about which accents
you ignore. In practice I don't think people do that.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 16:42:13
Message-ID: 877i9usdne.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:

> On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
>> >While it's true POSIX locales don't handle this, other collation
>> >libraries do and we should support them if the user wants.

I think that's backwards. We have to go with the lowest common denominator
functionality of those libraries if we're going to be portable. As long as
it's a superset of the SQL standard functionality. If we support features of
some of them that can't be emulated with others then users end up with SQL
code that will only work on some builds and not others. That might be worth it
for some features but I'm not sure this is one.

> Well, yes. Accents and case are attributes of a character. (I'm using
> the unicode model here). So, to do a case insensetive match you take
> the characters, strip the attributes and then do the comparison. There
> are specialised routines which handle the denormalisation of the string
> for you so in theory you could even get specific about which accents
> you ignore. In practice I don't think people do that.

I don't think composable unicode characters are really about collations. I
think it had more to do with representing glyphs in UTF32 before they gave up
on that. Does anyone still use composable characters?

Note that we don't currently support composable characters at all. I'm not
sure if that's a "nobody really cares" issue or a bug we should aim to fix
with real collation support.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 17:36:20
Message-ID: 20080902173620.GG3664@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 02, 2008 at 05:42:13PM +0100, Gregory Stark wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
> >> >While it's true POSIX locales don't handle this, other collation
> >> >libraries do and we should support them if the user wants.
>
> I think that's backwards. We have to go with the lowest common denominator
> functionality of those libraries if we're going to be portable.

And I think that's backwards. Why can we only use a feature once every
OS out there implements it? We still run on systems that don't have SSL
support. LC_TYPE settings are not portable between systems, yet that
doesn't bother anyone. Why should we have a problem with collate
settings not being portable?

> I don't think composable unicode characters are really about collations. I
> think it had more to do with representing glyphs in UTF32 before they gave up
> on that. Does anyone still use composable characters?

Lookup the various normalisations forms:
http://en.wikipedia.org/wiki/Unicode_normalization
In particular Normal Form D.

Sure, composable characters have nothing to do with collation, but they
provide a uniform way of doing accent insensetive collation.

> Note that we don't currently support composable characters at all.

Any character which is an accent on a latin character is a decomposable
character. And last I checked we supported those.

Have a niceday,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 18:47:08
Message-ID: 20080902184708.GK12610@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Radek Strnad escribió:
> Ok, so do you suggest to leave it with a notice "reindex database" or start
> to solve it somehow?

I don't know. If there are two tasks that need the same treatment, it
seems a safe conclusion that they need a common solution.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-02 19:01:53
Message-ID: 2801.1220382113@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> And I think that's backwards. Why can we only use a feature once every
> OS out there implements it? We still run on systems that don't have SSL
> support. LC_TYPE settings are not portable between systems, yet that
> doesn't bother anyone. Why should we have a problem with collate
> settings not being portable?

If we're going to approach it that way, we need a syntax for CREATE
COLLATION that doesn't hard-wire what the possible collation modifiers
are.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-10 08:29:14
Message-ID: 48C7855A.40605@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Radek Strnad wrote:
> Progress so far:
> - created catalogs pg_collation a pg_charset which are filled with three
> standard collations
> - initdb changes rows called "DEFAULT" in both catalogs during the bki
> bootstrap phase with current system LC_COLLATE and LC_CTYPE or those set by
> command line.
> - new collations can be defined with command CREATE COLLATION <collation
> name> FOR <character set specification> FROM <existing collation name>
> [STRCOLFN <fn name>]
> [ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ] [
> LCCTYPE <lc_ctype> ]
> - because of pg_collation and pg_charset are catalogs individual for each
> database, if you want to create a database with collation other than
> specified, create it in template1 and then create database

I have to wonder, is all that really necessary? The feature you're
trying to implement is to support database-level collation at first, and
perhaps column-level collation later. We don't need support for
user-defined collations and charsets for that.

If leave all that out of the patch for now, we'll have a much slimmer,
and just as useful patch, implementing database-level collation. We can
add those catalogs later if we need them, but I don't think there's much
point in adding all that infrastructure if they just reflect the locales
installed in the operating system.

> - when connecting to database, it retrieves locales from pg_database and
> sets them

This is the real gist of this patch.

> Design & functionality changes left:
> - move retrieveing collation from pg_database to pg_type

I don't understand this item. What will you move?

> - get case sensitivity and pad characteristic working

I feel we should leave this to the collation implementation.

> - when creating database with different collation than database cluster, the
> database has to be reindexed. Any idea how to do it? Function
> ReindexDatabase works only when database is opened.

That's a tricky one. One idea is to prohibit choosing a different
collation than the one in the template database, unless we know it's
safe to do so without reindexing. The problem is that we don't know
whether it's safe. A simple but limiting solution would be to require
that the template database has the same collation as the database that's
being created, except that template0 can always be used as template.
template0 is safe, because there's no indexes on text columns there.

Note that we already have the same problem with encodings. If you create
a database with LATIN1 encoding, load it with data, and then use that as
a template for a database with UTF-8 encoding, the text data will be
incorrectly encoded. We should probably fix that too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-10 08:48:20
Message-ID: 20080910084820.GB27812@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote:
> Radek Strnad wrote:
> >- because of pg_collation and pg_charset are catalogs individual for each
> >database, if you want to create a database with collation other than
> >specified, create it in template1 and then create database
>
> I have to wonder, is all that really necessary? The feature you're
> trying to implement is to support database-level collation at first, and
> perhaps column-level collation later. We don't need support for
> user-defined collations and charsets for that.

Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.

> >Design & functionality changes left:
> >- move retrieveing collation from pg_database to pg_type
>
> I don't understand this item. What will you move?

Long term, the collation is a property of the type, but I agree, I'm not
sure why this patch needs it.

> That's a tricky one. One idea is to prohibit choosing a different
> collation than the one in the template database, unless we know it's
> safe to do so without reindexing.

But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.

> Note that we already have the same problem with encodings. If you create
> a database with LATIN1 encoding, load it with data, and then use that as
> a template for a database with UTF-8 encoding, the text data will be
> incorrectly encoded. We should probably fix that too.

I'd say forbid more than one encoding in a cluster, but that's just my
opinion :)

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-10 09:51:02
Message-ID: 48C79886.9030504@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote:
>> Radek Strnad wrote:
>>> - because of pg_collation and pg_charset are catalogs individual for each
>>> database, if you want to create a database with collation other than
>>> specified, create it in template1 and then create database
>> I have to wonder, is all that really necessary? The feature you're
>> trying to implement is to support database-level collation at first, and
>> perhaps column-level collation later. We don't need support for
>> user-defined collations and charsets for that.
>
> Since the set of collations isn't exactly denumerable, we need some way
> to allow the user to specify the collation they want. The only
> collation PostgreSQL knows about is the C collation. Anything else is
> user-defined.

Let's just use the name of the OS locale, like we do now. Having a
pg_collation catalog just moves the problem elsewhere: we'd still need
something in pg_collation to tie the collation to the OS locale.

>>> Design & functionality changes left:
>>> - move retrieveing collation from pg_database to pg_type
>> I don't understand this item. What will you move?
>
> Long term, the collation is a property of the type, ...

You might want to provide a default collation for a type as well, but
the very finest grade is that you can specify collation for every (text)
comparison operator in your query. Of course you don't want to do that
for every query, which is why we should provide defaults at different
levels: columns, tables, database. And perhaps types as well, but that's
not the most interesting case.

I'm not sure what the SQL spec says about that, but I believe it
provides syntax and rules for all that.

>> That's a tricky one. One idea is to prohibit choosing a different
>> collation than the one in the template database, unless we know it's
>> safe to do so without reindexing.
>
> But that put us back where we started: every database having the same
> collation. We're trying to move away from that. Just reindex everything
> and be done with it.

That's easier said than done, unfortunately.

>> Note that we already have the same problem with encodings. If you create
>> a database with LATIN1 encoding, load it with data, and then use that as
>> a template for a database with UTF-8 encoding, the text data will be
>> incorrectly encoded. We should probably fix that too.
>
> I'd say forbid more than one encoding in a cluster, but that's just my
> opinion :)

Yeah, that's pretty useless, at least without support for different
locales on different databases. But might as well keep it unless there's
a pressing reason to drop it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-10 10:12:05
Message-ID: 48C79D75.7020609@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas napsal(a):

>> Design & functionality changes left:
>> - move retrieveing collation from pg_database to pg_type
>

The problem there is that pg_collation is local catalog, but pg_database
is global catalog. IIRC, It was discussed during last commitfest. I
think it is bad idea to make this kind of dependency. It seems to me
better to implement something what we will use later to avoid useless
and confusing dependency.

Zdenek


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-10 10:23:34
Message-ID: 20080910102334.GC27812@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:
> >Since the set of collations isn't exactly denumerable, we need some way
> >to allow the user to specify the collation they want. The only
> >collation PostgreSQL knows about is the C collation. Anything else is
> >user-defined.
>
> Let's just use the name of the OS locale, like we do now. Having a
> pg_collation catalog just moves the problem elsewhere: we'd still need
> something in pg_collation to tie the collation to the OS locale.

There's not a one-to-one mapping between collation and locale name. A
locale name includes information about the charset and a collation may
have paramters like case-sensetivity and pad-attribute which are not
present in the locale name. You need a mapping anyway, which is what
this table is for.

The difference in collation between CHAR() and VARCHAR() is the usual
example here.

> >Long term, the collation is a property of the type, ...
>
> I'm not sure what the SQL spec says about that, but I believe it
> provides syntax and rules for all that.

The spec is quite detailed about and I posted code to do it years ago.
The point is that we don't need to go that far with this patch.

> >But that put us back where we started: every database having the same
> >collation. We're trying to move away from that. Just reindex everything
> >and be done with it.
>
> That's easier said than done, unfortunately.

I don't see an alternative.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-16 13:30:15
Message-ID: 48CFB4E7.3030301@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:
>>> Since the set of collations isn't exactly denumerable, we need some way
>>> to allow the user to specify the collation they want. The only
>>> collation PostgreSQL knows about is the C collation. Anything else is
>>> user-defined.
>> Let's just use the name of the OS locale, like we do now. Having a
>> pg_collation catalog just moves the problem elsewhere: we'd still need
>> something in pg_collation to tie the collation to the OS locale.
>
> There's not a one-to-one mapping between collation and locale name. A
> locale name includes information about the charset and a collation may
> have paramters like case-sensetivity and pad-attribute which are not
> present in the locale name. You need a mapping anyway, which is what
> this table is for.

Ideally, we would delegate the case-sensitivity and padding to the
collation implementation (ie. OS setlocale() or ICU). That said, I don't
think operating systems normally ship case-insensitive variants of
locales by default, so I agree it would be nice if we could implement
that ourselves. Still, we could identify case-sensitive locale names for
example by a suffix, like "en_GB.UTF8.case-insensitive".

I agree we will eventually need a way to give shorthand names for
collations, and a pg_collation catalog will then come handy. But that
can wait until we have the basic infrastructure ready to support column
and query-level collation.

>>> But that put us back where we started: every database having the same
>>> collation. We're trying to move away from that. Just reindex everything
>>> and be done with it.
>> That's easier said than done, unfortunately.
>
> I don't see an alternative.

Well, I proposed disallowing using a different collation than the source
database, except for using template0 as the source. That's pretty
limited, but is trivial to implement and still let's you have databases
with different collations in the same cluster.

I worked a bit on Radek's patch, stripping out all the pg_collate and
pg_charset catalog changes and commands, leaving just the core
functionality of database-level collations. It needs some cleanup and
documentation, but something like this I'd like to commit in this commit
fest. The new catalogs can wait until we have a real need for them.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
wip-collation-nocatalogs-1.patch text/x-diff 27.2 KB

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-16 13:52:08
Message-ID: 87zlm8fbav.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:

> Well, I proposed disallowing using a different collation than the source
> database, except for using template0 as the source. That's pretty limited, but
> is trivial to implement and still let's you have databases with different
> collations in the same cluster.

+ if (strcmp(dbtemplate, "template0") != 0 &&
+ (strcmp(lc_collate, src_collation) || strcmp(lc_ctype, src_ctype)))
+ ereport(NOTICE,
+ (errmsg("database \"%s\" needs to be reindexed manually (REINDEX DATABASE)",
+ dbname)));
+

This isn't what you described but I think I prefer it this way as just a
warning not an error. I can easily imagine cases where the admin knows there
are either no indexes or no data in their database or where they're perfectly
happy to reindex. A flat prohibition seems annoying.

That said it's worth noting that this would tie our hands with proposals like
retail vacuum which risk database corruption if they fail to refind an index
pointer for a tuple...

Incidentally it seems like the warning should actually explain *why* it needs
to be reindexed manually and perhaps what the consequences are until it is.

AFAIK we can't easily connect to the new database and do some fiddling with
it, can we? If we could we could check if there are any non-empty indexes
which depend on the collation and only print the warning if we find any (and
even mark them invalid).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-16 15:00:13
Message-ID: 48CFC9FD.408@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark napsal(a):
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>

<snip>

> AFAIK we can't easily connect to the new database and do some fiddling with
> it, can we? If we could we could check if there are any non-empty indexes
> which depend on the collation and only print the warning if we find any (and
> even mark them invalid).
>

Autovacum uses InitPostgres function to swith to another database. I'm
not sure how much safe it is in create database command and when we are
already switched we can reindex affected indexes.

Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-16 15:06:55
Message-ID: 48CFCB8F.7050308@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas napsal(a):
> Martijn van Oosterhout wrote:
>> On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:
>>>> Since the set of collations isn't exactly denumerable, we need some way
>>>> to allow the user to specify the collation they want. The only
>>>> collation PostgreSQL knows about is the C collation. Anything else is
>>>> user-defined.
>>> Let's just use the name of the OS locale, like we do now. Having a
>>> pg_collation catalog just moves the problem elsewhere: we'd still
>>> need something in pg_collation to tie the collation to the OS locale.
>>
>> There's not a one-to-one mapping between collation and locale name. A
>> locale name includes information about the charset and a collation may
>> have paramters like case-sensetivity and pad-attribute which are not
>> present in the locale name. You need a mapping anyway, which is what
>> this table is for.
>
> Ideally, we would delegate the case-sensitivity and padding to the
> collation implementation (ie. OS setlocale() or ICU). That said, I don't
> think operating systems normally ship case-insensitive variants of
> locales by default, so I agree it would be nice if we could implement
> that ourselves. Still, we could identify case-sensitive locale names for
> example by a suffix, like "en_GB.UTF8.case-insensitive".

The idea was to call to_upper (or to_lower) before case-sensitive
collation processing. It is difficult to determine from suffix if it is
sensitive or not.

Zdenek

PS: We can discuss it in Prato


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-16 15:35:11
Message-ID: 20080916153511.GG9229@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala escribió:
> Gregory Stark napsal(a):

>> AFAIK we can't easily connect to the new database and do some fiddling with
>> it, can we? If we could we could check if there are any non-empty indexes
>> which depend on the collation and only print the warning if we find any (and
>> even mark them invalid).
>
> Autovacum uses InitPostgres function to swith to another database. I'm
> not sure how much safe it is in create database command and when we are
> already switched we can reindex affected indexes.

It's only supposed to work if you're not previously connected to any
database. (Autovacuum never "switches to another database"; any
particular worker only connects to a single database).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-18 14:41:16
Message-ID: 48D2688C.4020408@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>
>> Well, I proposed disallowing using a different collation than the source
>> database, except for using template0 as the source. That's pretty limited, but
>> is trivial to implement and still let's you have databases with different
>> collations in the same cluster.
>
> + if (strcmp(dbtemplate, "template0") != 0 &&
> + (strcmp(lc_collate, src_collation) || strcmp(lc_ctype, src_ctype)))
> + ereport(NOTICE,
> + (errmsg("database \"%s\" needs to be reindexed manually (REINDEX DATABASE)",
> + dbname)));
> +
>
> This isn't what you described but I think I prefer it this way as just a
> warning not an error.

Well, I'd prefer to make it an error, but I'm willing to listen if
others feel otherwise. I don't think the inconvenience of having to use
template0 is that big, compared to the potential of strange behavior
people would run into if they ignore the advice to reindex.

One weakness with a straight strcmp comparison is that it won't
recognize aliases of the same locale. For example, "fi_FI.UTF8" and
"fi_FI.UTF-8".

> AFAIK we can't easily connect to the new database and do some fiddling with
> it, can we? If we could we could check if there are any non-empty indexes
> which depend on the collation and only print the warning if we find any (and
> even mark them invalid).

I don't see that happening, unfortunately..

Attached is an updated version of the stripped-down patch. I've cleaned
it up a bit, and added more sanity checks. Documentation is still
missing and I haven't test it much.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
wip-collation-nocatalogs-2.patch text/x-diff 31.2 KB

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-19 06:23:05
Message-ID: 20080919062305.GA23582@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 18, 2008 at 05:41:16PM +0300, Heikki Linnakangas wrote:
> Attached is an updated version of the stripped-down patch. I've cleaned
> it up a bit, and added more sanity checks. Documentation is still
> missing and I haven't test it much.

FWIW, I disagree with the stripping down and think we should aim for
the whole patch as submitted. Historically, the hardest part of getting
collation support into postgres has been the catalog changes and easier
the support for something other than OS locales. I supported the patch as
it was precisely because it finally did that, and the stripping down
takes us back to square one.

Implementing the COLLATE keyword is much easier once you have the underlying
support and woldn't involve any backward incompatabilities. The
stripped down version gets us to a state which will only be supported
(hopefully) for one release.

For anyone counting, Firebird added support for ICU more than three
years ago.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-19 07:13:43
Message-ID: 48D35127.7000503@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> FWIW, I disagree with the stripping down and think we should aim for
> the whole patch as submitted. Historically, the hardest part of getting
> collation support into postgres has been the catalog changes and easier
> the support for something other than OS locales. I supported the patch as
> it was precisely because it finally did that, and the stripping down
> takes us back to square one.

The catalog changes are one issue, but hardly the biggest one.
Database-level collation works just as well without them, and for
finer-grain control, the proposed patch was nowhere near enough. You'll
need further catalog changes to track collation for each index column,
for example, probably something in the operator families or classes etc.

In addition to catalog changes, for finer grained collation you need
changes in the planner and executor as well. In planner, to provide
support for the concept of a sort order using a specific collation, and
track that whenever sort order is handled. In executor, the capability
to compare and sort using different locales at different times. Those
changes seem a lot more difficult to me than adding a couple of system
catalogs, which is pretty straightforward.

It's not like the patch is going to disappear from planet Earth if it
doesn't get committed for 8.4. It's still valuable and available when
the new catalogs are needed.

> Implementing the COLLATE keyword is much easier once you have the underlying
> support and woldn't involve any backward incompatabilities. The
> stripped down version gets us to a state which will only be supported
> (hopefully) for one release.

Yes, I hope so too. Implementation in phases is a good thing in my mind,
but adding a lot of complexity that isn't need until in a future release
is not. We will need to support the 8.4 CREATE DATABASE syntax for
backwards-compatibility, but I don't see a problem with that.

> For anyone counting, Firebird added support for ICU more than three
> years ago.

ICU is orthogonal to this patch. This patch didn't provide ICU support,
and we could start using ICU without the catalog changes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-19 13:40:56
Message-ID: 48D3ABE8.3030802@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's an updated version of the stripped-down patch, now with
documentation changes, plus a couple of minor bug fixes.

I think this is looking pretty good now, and I would appreciate review
from others before I go ahead committing this. Is there any more places
in the documentation that needs updating, or any grammar mistakes?
Further opinions on whether to allow using a database with different
collation order as template, requiring a manual reindex and facing weird
results if that's neglected, or to error out with a hint suggesting to
use template0 as template (as is in the patch now)?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
wip-collation-nocatalogs-3.patch text/x-diff 43.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-19 21:31:52
Message-ID: 26236.1221859912@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Here's an updated version of the stripped-down patch, now with
> documentation changes, plus a couple of minor bug fixes.

> I think this is looking pretty good now, and I would appreciate review
> from others before I go ahead committing this.

I guess the $64 question is whether this moves us towards where we want to
be? If the ultimate goal is to get to per-column collations, I think this
patch is likely to be counterproductive, because it will establish syntax
and semantics that we'll have to try to remain compatible with. If we're
going to be satisfied with per-database collations, then great, let's do
it.

As far as reviewing the patch itself goes:

* It's missing pg_dump (or I guess really pg_dumpall) support.

* Please call the new columns datcollate and datctype. Just because
someone ignored the naming convention years ago for pg_database.encoding
doesn't mean we should ignore it here. Also, if you're going to name one
column based on an LC_foo variable's name, do the other one the same way.
I note also that this would match the keywords used in CREATE DATABASE.

* Don't even think of adding system catalog columns without updating
catalogs.sgml.

* You should try to get rid of LOCALE_NAME_BUFLEN altogether. Definitely
the comment about it in pg_control.h is now obsolete.

* You *must* bump PG_CONTROL_VERSION when you change its layout.
(Don't forget catversion too, since you're also changing pg_database)

* This doc sentence reads a bit awkwardly:

An important restriction, however, is that each database character set
must be compatible with the database's <envar>LC_CTYPE</> setting.

Maybe

An important restriction, however, is that each database's character set
must be compatible with the database's <envar>LC_CTYPE</> setting.

Also I wonder whether we shouldn't say that it must be compatible with
LC_CTYPE *and* LC_COLLATE.

* This bit in the CREATE DATABASE ref page is also awkward:

Any character set encoding specified for the new database must be
compatible with the chosen COLLATE and CTYPE settings.

Maybe

The character set encoding used for the new database must be
compatible with the chosen COLLATE and CTYPE settings.

* This bit doesn't look "ready to commit":

***************
*** 754,759 ****
--- 749,756 ----
<option>-l</option> option or the <command>\l</command> command
of <command>psql</command>.

+ XXX It would be nice to show the "korean" database here. And we should
+ modify psql to show locale as well
<screen>
$ <userinput>psql -l</userinput>
List of databases

* This makes sense, but then shouldn't we make the identical restriction
for encoding?

+ The <literal>COLLATE</> and <literal>CTYPE</> settings must match
+ those of the template database, except when template0 is used as
+ template. This is because <literal>COLLATE</> and <literal>CTYPE</>

* I can't tell whether the writer of this bit thought that a blank line
would come out as a paragraph break or not. Either add <para>'s or
remove the blank lines to make it look like one para in the source:

--- 76,105 ----

<para>
<command>initdb</command> initializes the database cluster's default
! locale and character set encoding.
!
! The character set encoding, collation order (<literal>LC_COLLATE</>)
! and character set classes (<literal>LC_CTYPE</>, e.g. upper, lower,
! digit) can be set separately for a database when it is created.
! <command>initdb</command> determines those settings for the
! <literal>template1</literal> database, which will serve as the
! default for all other databases.
!
! To alter the default collation order or character set classes, use the
! <option>--lc-collate</option> and <option>--lc-ctype</option> options.
! Collation orders other than <literal>C</> or <literal>POSIX</> also have
! a performance penalty. For these reasons it is important to choose the
! right locale when running <command>initdb</command>.
!
! The remaining locale categories can be changed later when the server
! is started. You can also use <option>--locale</option> to set the
! default for all locale categories, including collation order and
! character set classes. All server locale values (<literal>lc_*</>) can
! be displayed via <command>SHOW ALL</>.
More details can be found in <xref linkend="locale">.

! To alter the default encoding, use the <option>--encoding</option>.
! More details can be found in <xref linkend="multibyte">.
</para>

* This bit for resetxlog is just wrong now:

--- 62,70 ----
by specifying the <literal>-f</> (force) switch. In this case plausible
values will be substituted for the missing data. Most of the fields can be
expected to match, but manual assistance might be needed for the next OID,
! next transaction ID and epoch, next multitransaction ID and offset, and
! WAL starting address fields.
! The first five of these can be set using the switches discussed below.
If you are not able to determine correct values for all these fields,
<literal>-f</> can still be used, but
the recovered database must be treated with even more suspicion than

I guess it should just say "These fields can be set using the switches
discussed below.", since there aren't any that can't be.

* This is no good:

+ SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DATABASE);
+ SetConfigOption("lc_ctype", ctype, PGC_INTERNAL, PGC_S_DATABASE);

The source *must* be PGC_S_OVERRIDE to ensure it can't be overridden.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-19 21:36:26
Message-ID: 48D41B5A.7050705@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Here's an updated version of the stripped-down patch, now with
> documentation changes, plus a couple of minor bug fixes.

Another update, marching towards committing. Now with pg_dump/pg_dumpall
support, and collation/ctype is also shown in psql \l output.

I wonder if we should provide a shorthand "LOCALE=<localename>" option,
in addition to separate COLLATE and CTYPE options? Most people will
always set them together. It could also set the rest of the locale
categories, lc_numeric, lc_monetary, and so on, as database-specific
options. Like "ALTER DATABASE <dbname> SET lc_messages=<localename> ..."
would.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
wip-collation-nocatalogs-4.patch text/x-diff 53.5 KB

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-19 21:59:57
Message-ID: 20080919215957.GC2905@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 19, 2008 at 10:13:43AM +0300, Heikki Linnakangas wrote:
> In addition to catalog changes, for finer grained collation you need
> changes in the planner and executor as well. In planner, to provide
> support for the concept of a sort order using a specific collation, and
> track that whenever sort order is handled. In executor, the capability
> to compare and sort using different locales at different times. Those
> changes seem a lot more difficult to me than adding a couple of system
> catalogs, which is pretty straightforward.

To be honest, I think that's the easy part. When I submitted the
collate patch a few years back, getting the sort order working was the
easy part, precisely because COLLATE clauses can affect the generated
plan in precisely one way: can you use index FOO or not. Once you're
past that point you can throw almost all the collation information
away.

As you point out though, the shear volume of catalog changes required
obscure that fact somewhat.

> It's not like the patch is going to disappear from planet Earth if it
> doesn't get committed for 8.4. It's still valuable and available when
> the new catalogs are needed.

I just prefer it as it was because it takes care of a useful subset of
the features people want in a way that is compatable for the future.
Whereas the stripped down version, I'm not sure it gets us anywhere.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-22 07:22:35
Message-ID: 48D747BB.8020407@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Fri, Sep 19, 2008 at 10:13:43AM +0300, Heikki Linnakangas wrote:
>> It's not like the patch is going to disappear from planet Earth if it
>> doesn't get committed for 8.4. It's still valuable and available when
>> the new catalogs are needed.
>
> I just prefer it as it was because it takes care of a useful subset of
> the features people want in a way that is compatable for the future.
> Whereas the stripped down version, I'm not sure it gets us anywhere.

It gives the capability to have different collations in different
databases within the same cluster. IOW, the same feature as the original
patch. Finer-grained collation would be even better, of course, but
database-level collations is a valuable feature on its own.

The critical question is how much compatibility trouble we're going to
get by having to support the extension to CREATE DATABASE in the
stripped-down patch, when the pg_collation catalog is introduced in a
later version in one form or another. So let's investigate that a bit
further:

In the stripped down version, the CREATE DATABASE syntax is:

CREATE DATABASE <name> WITH COLLATE=<locale name> CTYPE=<locale name>

In the original patch, the CREATE DATABASE syntax is:

CREATE DATABASE <name> WITH COLLATE=<collation name>

The first thing that we see is that the COLLATE keyword means different
things, so it's probably best to change that into:

CREATE DATABASE <name> WITH LC_COLLATE=<locale name> LC_CTYPE=<locale name>

in the stripped-down version. Then we need a way to map the
stripped-down syntax into the one in the original patch. That's just a
matter of looking up the collation in the pg_collation catalog with the
right LC_COLLATE and LC_CTYPE.

Things get slightly more complicated if there is no such collation in
the pg_collation catalog. One option is to simply create it at that point.

BTW, the original patch didn't have any provision for creating rows in
pg_collation reflecting the locales available in the OS, but I think
we'd need that. Otherwise the DBA would need to manually run CREATE
COLLATION for every collation they want users to be able to use.
Assuming we do that, the situation that we can't find a row with given
LC_COLLATE and LC_CTYPE should not arise in practice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-22 14:03:50
Message-ID: 14708.1222092230@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> The first thing that we see is that the COLLATE keyword means different
> things, so it's probably best to change that into:

> CREATE DATABASE <name> WITH LC_COLLATE=<locale name> LC_CTYPE=<locale name>

> in the stripped-down version. Then we need a way to map the
> stripped-down syntax into the one in the original patch. That's just a
> matter of looking up the collation in the pg_collation catalog with the
> right LC_COLLATE and LC_CTYPE.

It seems to me that in an installation using libc-based collation
support, the collation names are likely to be the same as allowed values
of LC_COLLATE anyway. So inventing different keywords doesn't really
seem necessary.

What might be sensible to ask is whether it is ever actually reasonable
for LC_COLLATE and LC_CTYPE to have different settings. If we were
willing to enforce that they be the same, we could reduce this to just
the standard syntax COLLATE=something and be done with it. Not being
much of a user of anything except C locale, I might be the wrong person
to opine on this; but it seems to me that having them different is far
more likely to be a mistake than desirable.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-22 14:13:47
Message-ID: 48D7A81B.7050602@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> What might be sensible to ask is whether it is ever actually reasonable
> for LC_COLLATE and LC_CTYPE to have different settings. If we were
> willing to enforce that they be the same, we could reduce this to just
> the standard syntax COLLATE=something and be done with it. Not being
> much of a user of anything except C locale, I might be the wrong person
> to opine on this; but it seems to me that having them different is far
> more likely to be a mistake than desirable.

Agreed, it doesn't make much sense. I find it hard to imagine anyone
doing that on purpose, but we have supported it at initdb time for ages.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-22 15:02:04
Message-ID: 48D7B36C.6080201@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas napsal(a):

>> For anyone counting, Firebird added support for ICU more than three
>> years ago.
>
> ICU is orthogonal to this patch. This patch didn't provide ICU support,
> and we could start using ICU without the catalog changes.

This patch should allow to use both system catalog and ICU. pg_collate catalog
contains comparing function which is called for string comparing and if somebody
creates function which will use ICU then ICU will be supported. It is advantage
of pg_catalog. Without them you can have system or ICU but probably not both.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-22 15:11:04
Message-ID: 48D7B588.4050000@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:
> Heikki Linnakangas napsal(a):
>
>>> For anyone counting, Firebird added support for ICU more than three
>>> years ago.
>>
>> ICU is orthogonal to this patch. This patch didn't provide ICU
>> support, and we could start using ICU without the catalog changes.
>
> This patch should allow to use both system catalog and ICU.

Not without another patch that actually introduces ICU support. What
that would look like, how that would be stored in the catalogs, and
whether we want that is whole another topic. Without that, the STRCOLFN
part of the original patch is pointless, and I would've ripped that out
anyway even if we decided to add the pg_collation catalog in this release.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 09:20:49
Message-ID: 48D8B4F1.6040309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Committed.

Tom Lane wrote:
> * You should try to get rid of LOCALE_NAME_BUFLEN altogether. Definitely
> the comment about it in pg_control.h is now obsolete.

Yep. I removed LOCALE_NAME_BUFLEN. The real max length of a locale name
is now NAMEDATALEN, because it's stored in a name field in pg_database.
NAMEDATALEN is only 64 bytes, whereas LOCALE_NAME_BUFLEN was 128. 64
bytes should be enough for "en_GB.UTF8" style locale names, but I wonder
if it's enough for the longer names used on Windows? Could someone
confirm that, please?

> An important restriction, however, is that each database's character set
> must be compatible with the database's <envar>LC_CTYPE</> setting.
>
> Also I wonder whether we shouldn't say that it must be compatible with
> LC_CTYPE *and* LC_COLLATE.

I think we should, but that's in fact not what is tested. Before the
patch as well, we only tested that the encoding matches LC_CTYPE, but
you could set LC_COLLATE to anything. I'll work on a subsequent patch to
tighten that.

> * This makes sense, but then shouldn't we make the identical restriction
> for encoding?
>
> + The <literal>COLLATE</> and <literal>CTYPE</> settings must match
> + those of the template database, except when template0 is used as
> + template. This is because <literal>COLLATE</> and <literal>CTYPE</>

It wouldn't be as bullet-proof for encoding, because we'd still have the
problem that the encoding in the shared system tables would be
ill-defined. That's a pre-existing problem, though. We could simply
remove support for per-database encodings altogether and fix it at
initdb time, as Martijn suggest earlier, but now that we have
per-database locales, per-database encodings is a lot more useful as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Dave Page" <dpage(at)pgadmin(dot)org>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 10:26:23
Message-ID: 937d27e10809230326g168a7559r5609a2f2b0fda8b7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 23, 2008 at 10:20 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Committed.

*adds yet another item to his pgAdmin todo list* :-(

> Tom Lane wrote:
>>
>> * You should try to get rid of LOCALE_NAME_BUFLEN altogether. Definitely
>> the comment about it in pg_control.h is now obsolete.
>
> Yep. I removed LOCALE_NAME_BUFLEN. The real max length of a locale name is
> now NAMEDATALEN, because it's stored in a name field in pg_database.
> NAMEDATALEN is only 64 bytes, whereas LOCALE_NAME_BUFLEN was 128. 64 bytes
> should be enough for "en_GB.UTF8" style locale names, but I wonder if it's
> enough for the longer names used on Windows? Could someone confirm that,
> please?

The longest I can find is:

Serbian (Cyrillic)_Bosnia and Herzegovina

at 42 characters.

--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 10:26:52
Message-ID: 20080923102652.GB6986@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 22, 2008 at 10:22:35AM +0300, Heikki Linnakangas wrote:
> BTW, the original patch didn't have any provision for creating rows in
> pg_collation reflecting the locales available in the OS, but I think
> we'd need that. Otherwise the DBA would need to manually run CREATE
> COLLATION for every collation they want users to be able to use.
> Assuming we do that, the situation that we can't find a row with given
> LC_COLLATE and LC_CTYPE should not arise in practice.

You're assuming collations are denumerable. They're not. There is no way
to find the list of available collations/locales. You may be able to
guess a few but certainly not all of them.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 10:32:07
Message-ID: 20080923103207.GC6986@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 22, 2008 at 06:11:04PM +0300, Heikki Linnakangas wrote:
> >This patch should allow to use both system catalog and ICU.
>
> Not without another patch that actually introduces ICU support. What
> that would look like, how that would be stored in the catalogs, and
> whether we want that is whole another topic. Without that, the STRCOLFN
> part of the original patch is pointless, and I would've ripped that out
> anyway even if we decided to add the pg_collation catalog in this release.

Eh? How you store collations names is easy. Every collation has a
textual name which is what we store in the catalog. I'm not sure why
you'd think it'd be any more complicated than that. And it has
precisely nothing to do with ICU and everything to do with being able
to support multiple source of collation information. We already have
two: builtin (strcmp/memcmp) and C library (strcoll). Letting peope add
more is the goal.

I'm sure once the catalog exists the existing ICU-for-Postgres patch
will be adjusted to use it.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 10:32:38
Message-ID: 48D8C5C6.7010708@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Mon, Sep 22, 2008 at 10:22:35AM +0300, Heikki Linnakangas wrote:
>> BTW, the original patch didn't have any provision for creating rows in
>> pg_collation reflecting the locales available in the OS, but I think
>> we'd need that. Otherwise the DBA would need to manually run CREATE
>> COLLATION for every collation they want users to be able to use.
>> Assuming we do that, the situation that we can't find a row with given
>> LC_COLLATE and LC_CTYPE should not arise in practice.
>
> You're assuming collations are denumerable. They're not. There is no way
> to find the list of available collations/locales. You may be able to
> guess a few but certainly not all of them.

"locale -a" manages to do it somehow...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 10:51:09
Message-ID: 20080923105109.GD6986@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote:
> Martijn van Oosterhout wrote:
> >You're assuming collations are denumerable. They're not. There is no way
> >to find the list of available collations/locales. You may be able to
> >guess a few but certainly not all of them.
>
> "locale -a" manages to do it somehow...

Sure, by (on glibc) opening the binary archive and parsing it and then
trying to reverse lookup the alias list. We could ofcourse program
something for each platform to determine a list but who is going to
maintain that? How do you handle the list changing?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 11:19:35
Message-ID: 48D8D0C7.9080407@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Mon, Sep 22, 2008 at 06:11:04PM +0300, Heikki Linnakangas wrote:
>>> This patch should allow to use both system catalog and ICU.
>> Not without another patch that actually introduces ICU support. What
>> that would look like, how that would be stored in the catalogs, and
>> whether we want that is whole another topic. Without that, the STRCOLFN
>> part of the original patch is pointless, and I would've ripped that out
>> anyway even if we decided to add the pg_collation catalog in this release.
>
> Eh? How you store collations names is easy. Every collation has a
> textual name which is what we store in the catalog. I'm not sure why
> you'd think it'd be any more complicated than that. And it has
> precisely nothing to do with ICU and everything to do with being able
> to support multiple source of collation information. We already have
> two: builtin (strcmp/memcmp) and C library (strcoll). Letting peope add
> more is the goal.

pg_collation catalog is also important for pg_dump, because system
collation names are not compatible over OS and pg_dump output should be
portable. pg_collation adds abstract layer which solve this problem.

> I'm sure once the catalog exists the existing ICU-for-Postgres patch
> will be adjusted to use it.

Yes, I agree with Martijn.

Zdenek


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 11:26:32
Message-ID: 48D8D268.2080305@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:
> pg_collation catalog is also important for pg_dump, because system
> collation names are not compatible over OS and pg_dump output should be
> portable. pg_collation adds abstract layer which solve this problem.

That's a valid point. We'll still need a way to map OS locale to
whatever internal names we invent for them, though, so I'm not sure if
the pg_collation catalog helps much, but just moves the problem
elsewhere. The pg_dump output of the CREATE COLLATION statements still
wouldn't be portable from one OS to another.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 12:38:26
Message-ID: 25529.1222173506@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote:
>> "locale -a" manages to do it somehow...

> Sure, by (on glibc) opening the binary archive and parsing it and then
> trying to reverse lookup the alias list. We could ofcourse program
> something for each platform to determine a list but who is going to
> maintain that? How do you handle the list changing?

exec("locale -a") ...

I suppose we'd need something else for Windows, but I'm sure there's
a way.

regards, tom lane


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 13:27:22
Message-ID: 48D8EEBA.5000705@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>> On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote:
>>> "locale -a" manages to do it somehow...
>
>> Sure, by (on glibc) opening the binary archive and parsing it and then
>> trying to reverse lookup the alias list. We could ofcourse program
>> something for each platform to determine a list but who is going to
>> maintain that? How do you handle the list changing?
>
> exec("locale -a") ...
>
> I suppose we'd need something else for Windows, but I'm sure there's
> a way.

IIRC, the data is in the registry. Should be enumerable somehow - we'll
have to do it platform specific of course, but it's not the first time
we'd do that for windows...

//Magnus


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: Collation support
Date: 2008-09-23 15:12:52
Message-ID: 48D90774.3070307@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> pg_collation catalog is also important for pg_dump, because system
>> collation names are not compatible over OS and pg_dump output should
>> be portable. pg_collation adds abstract layer which solve this problem.
>
> That's a valid point. We'll still need a way to map OS locale to
> whatever internal names we invent for them, though, so I'm not sure if
> the pg_collation catalog helps much, but just moves the problem
> elsewhere.

It is true. For names we can for example use RFC479 0IANA register)
http://tools.ietf.org/html/rfc4790#section-7
or use UNICODE terminology CLDR.

> The pg_dump output of the CREATE COLLATION statements still
> wouldn't be portable from one OS to another.
>

I don't think so that pg_collation catalog should be dumped (maybe only
with extra switch).

Zdenek


From: Petr Jelinek <pjmodos(at)pjmodos(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: Collation support
Date: 2008-09-24 05:18:47
Message-ID: 48D9CDB7.50004@pjmodos.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander wrote:
>> exec("locale -a") ...
>>
>> I suppose we'd need something else for Windows, but I'm sure there's
>> a way.
>
> IIRC, the data is in the registry. Should be enumerable somehow - we'll
> have to do it platform specific of course, but it's not the first time
> we'd do that for windows...
>

There is EnumSystemLocales API function in Windows.

--
Regards
Petr Jelinek (PJMODOS)


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Petr Jelinek <pjmodos(at)pjmodos(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: Collation support
Date: 2008-09-24 05:40:03
Message-ID: 48D9D2B3.3090608@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Petr Jelinek wrote:
> Magnus Hagander wrote:
>>> exec("locale -a") ...
>>>
>>> I suppose we'd need something else for Windows, but I'm sure there's
>>> a way.
>>
>> IIRC, the data is in the registry. Should be enumerable somehow - we'll
>> have to do it platform specific of course, but it's not the first time
>> we'd do that for windows...
>>
>
> There is EnumSystemLocales API function in Windows.

Ha, right. We even use it in the installer :-)
Bottom line remains: we can easily do this in a Windows-specific way if
we need to.

//Magnus