[WIP] patch - Collation at database level

Lists: pgsql-hackers
From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [WIP] patch - Collation at database level
Date: 2008-07-01 14:58:36
Message-ID: de5165440807010758h147d5fw1fc969bc66dcb300@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

I'm sending part of the code that I've done and is available for reviewing
and I'm asking for your comments and some help because I'm new to
PostgreSQL.

Proposal: http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php

The code contains changes of parser, pg_collation catalog, parsenodes.h for
CREATE COLLATION... and DROP COLLATION statements.

Questions that I would like to know answers:

1) Should I move DROP statement from DropCollationStmt to DropStmt? Some
statements are seperated and some are included in DropStmt? Is there any
reason for that?
2) What type should all names in CREATE and DROP statement in gram.y have?
I've chosen qualified_name but I know it's not the best choice.
3) All collations are created from existing collations. How do I ensure that
the collation already exists? Is there any possibility to define it in
gram.y?
4) For further functionality development is there anything more needed than
adding T_CreateCollationStmt and T_DropCollationStmt to
/src/backend/tcop/utility.c, and write functionality into collation.c?
5) Also can you look at the pg_catalog and tell me if anything is wrong with
it?

Thank you for all your replies in advance.

Regards

Radek Strnad

Attachment Content-Type Size
collation.patch text/x-diff 18.7 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-02 15:14:23
Message-ID: 20080702151423.GT18252@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Radek Strnad escribió:

> 2) What type should all names in CREATE and DROP statement in gram.y have?
> I've chosen qualified_name but I know it's not the best choice.

I think it should be ColId.

> 3) All collations are created from existing collations. How do I ensure that
> the collation already exists? Is there any possibility to define it in
> gram.y?

Certainly not -- shouldn't they come from a catalog? In that case, it
must come in parse analysis (parser/analyze.c I guess) or perhaps later,
when you actually execute the function to create the new collation.

> 5) Also can you look at the pg_catalog and tell me if anything is wrong with
> it?

Why does a collation have a schema?

What's the "existing collation"?

It seems a bit silly to have enum for what are basically boolean
variables. Why not just use "true" and "false"?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-02 16:55:42
Message-ID: 13084.1215017742@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Why does a collation have a schema?

Because the SQL spec says so. Also, if we don't put them in schemas,
we have no nice way to distinguish built-in and user-defined collations,
which creates a problem for pg_dump.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-02 17:02:36
Message-ID: 20080702170236.GV18252@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane escribió:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Why does a collation have a schema?
>
> Because the SQL spec says so. Also, if we don't put them in schemas,
> we have no nice way to distinguish built-in and user-defined collations,
> which creates a problem for pg_dump.

Oh, I see :-) In that case, qualified_name would seem the right symbol
to use in the parser.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-02 18:22:10
Message-ID: 871w2ccfjh.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Why does a collation have a schema?
>
> Because the SQL spec says so. Also, if we don't put them in schemas,
> we have no nice way to distinguish built-in and user-defined collations,
> which creates a problem for pg_dump.

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-02 19:11:39
Message-ID: 20080702191139.GC27142@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 02, 2008 at 07:22:10PM +0100, Gregory Stark wrote:
> > Because the SQL spec says so. Also, if we don't put them in schemas,
> > we have no nice way to distinguish built-in and user-defined collations,
> > which creates a problem for pg_dump.
>
> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
> to go around declaring what order code points should be sorted in? That seems
> like it would be... quite tedious!

Not that we'll ever use it, but ICU for example allows users to say:
"use collation X but move this code point somewhere else", essentially
allowing users tweak the collation on a small scale. In any case,
whatever collation library is used, we're unlikely to predefine every
possible collation in the system, there's too many (assuming they're
denumerable).

Have a niceday,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-02 19:21:06
Message-ID: de5165440807021221s26b1feaam461319aa1fe94ed2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

My patch should be "sort of" wrapper that will implement guts for further
development (collation at column level) like catalogs, creating collations
etc. When creating collation user will be able to choose which function to
use (by statement STRCOLFN - not in SQL standard). In the first stage I'll
implement function that will use system locales. Adding ICU or any other
library won't be that big deal.

Radek Strnad

On Wed, Jul 2, 2008 at 8:22 PM, Gregory Stark <stark(at)enterprisedb(dot)com>
wrote:

> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> >> Why does a collation have a schema?
> >
> > Because the SQL spec says so. Also, if we don't put them in schemas,
> > we have no nice way to distinguish built-in and user-defined collations,
> > which creates a problem for pg_dump.
>
> Out of curiosity, what is a "user-defined collation"? Are there SQL
> statements
> to go around declaring what order code points should be sorted in? That
> seems
> like it would be... quite tedious!
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
> Ask me about EnterpriseDB's Slony Replication support!
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-02 23:19:03
Message-ID: 17984.1215040743@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
> to go around declaring what order code points should be sorted in? That seems
> like it would be... quite tedious!

Hm, that's a good point. SQL99 has

<collation definition> ::=
CREATE COLLATION <collation name> FOR
<character set specification>
FROM <existing collation name>
[ <pad characteristic> ]

<existing collation name> ::= <collation name>

<pad characteristic> ::=
NO PAD
| PAD SPACE

which seems pretty stupid if you ask me --- all the mechanism required
to manage a new object type, just to enable PAD SPACE or not?
(Especially when PAD SPACE itself is an utterly broken, useless concept
... but I digress.) You might as well just provide all the standard
collations in both variants and be done with it.

The statement looks the same in last year's 200n draft, so it's not
like they were just about to add some more capability.

We might be best off to treat collations like index access methods,
ie, they're theoretically add-able but there's no infrastructure for
managing them, and what's expected is that all the ones you need are
created by initdb.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-03 10:47:48
Message-ID: 486CAE54.6020003@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
>> to go around declaring what order code points should be sorted in? That seems
>> like it would be... quite tedious!
>
> Hm, that's a good point. SQL99 has
>
> <collation definition> ::=
> CREATE COLLATION <collation name> FOR
> <character set specification>
> FROM <existing collation name>
> [ <pad characteristic> ]
>
> <existing collation name> ::= <collation name>
>
> <pad characteristic> ::=
> NO PAD
> | PAD SPACE
>
> which seems pretty stupid if you ask me --- all the mechanism required
> to manage a new object type, just to enable PAD SPACE or not?
> (Especially when PAD SPACE itself is an utterly broken, useless concept
> ... but I digress.) You might as well just provide all the standard
> collations in both variants and be done with it.
>
> The statement looks the same in last year's 200n draft, so it's not
> like they were just about to add some more capability.

The proposed syntax of CREATE COLLATION is:

CREATE COLLATION <collation name> FOR <character set specification>
FROM <existing collation name> [STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ <accent sensitive> ]
[ LCCOLLATE <lc_collate> ] [ LCCTYPE <lc_ctype> ]

Which extends ANSI specification.

> We might be best off to treat collations like index access methods,
> ie, they're theoretically add-able but there's no infrastructure for
> managing them, and what's expected is that all the ones you need are
> created by initdb.

I think you cannot create all collation at bootstrap. You can only create record
for actual LC_COLLATION, because I there is not standard way how to obtain
complete list of supported collations and there is also problem if you install
new locales after initdb.

When I looked to another DB (MS SQL, MySQL, DB2, Firebird) then only Firebird
supports CREATE COLLATION command. Other databases has hard coded list of
locales. Hardcoded solution means to use some lib (e.g ICU) with unified names
or has locale name mapping for all supported OS.

I personally prefer open solution when I can create own collation and specify
collation function to handle it.

Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 08:22:32
Message-ID: 487323C8.4030001@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
>> to go around declaring what order code points should be sorted in? That seems
>> like it would be... quite tedious!
>

<snip>

> We might be best off to treat collations like index access methods,
> ie, they're theoretically add-able but there's no infrastructure for
> managing them, and what's expected is that all the ones you need are
> created by initdb.

I though more about it and I discussed it with Radek yesterday. The problem is
that collation must be created before user want to use CREATE DATABASE ...
COLLATE ... command. It inclines to have have pg_collation as a global catalog,
but ANSI specifies to use schema name in collation specification and schemes are
database specific ... It means that pg_collation have to be non-shared catalog
and new database only inherits collation from template db. And CREATE DATABASE
have to check list of collation in template database :(.

My conclusion is that CREATE COLLATION does not make much sense. I see two
possible solutions:

1) have global an local catalog for collation and have modified variants of
create cmd:

CREATE COLLATION ... GLOBAL|LOCAL

CREATE DATABASE will use only collation from global catalog

Local catalog will be useful when full support of collation will be available
mostly for specifying case sensitivity of collation.

2) Use Tom's suggested approach. Create list of collations in initdb phase. But
there is problem how to obtain list of supported collation on the server. I
think, only what is possible to do is to use default locale for creating default
collation for template1 database.

Any suggestion?

thanks Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 09:27:35
Message-ID: 48733307.10200@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala napsal(a):

>
> I though more about it and I discussed it with Radek yesterday. The
> problem is that collation must be created before user want to use CREATE
> DATABASE ... COLLATE ... command. It inclines to have have pg_collation
> as a global catalog, but ANSI specifies to use schema name in collation
> specification and schemes are database specific ... It means that
> pg_collation have to be non-shared catalog and new database only
> inherits collation from template db. And CREATE DATABASE have to check
> list of collation in template database :(.

thinking more ...
It must be shared catalog because pg_database will depend on it.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 09:49:52
Message-ID: 20080708094951.GE16671@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 08, 2008 at 11:27:35AM +0200, Zdenek Kotala wrote:
> Zdenek Kotala napsal(a):
>
> >
> >I though more about it and I discussed it with Radek yesterday. The
> >problem is that collation must be created before user want to use CREATE
> >DATABASE ... COLLATE ... command. It inclines to have have pg_collation
> >as a global catalog, but ANSI specifies to use schema name in collation
> >specification and schemes are database specific ... It means that
> >pg_collation have to be non-shared catalog and new database only
> >inherits collation from template db. And CREATE DATABASE have to check
> >list of collation in template database :(.
>
> thinking more ...
> It must be shared catalog because pg_database will depend on it.

Not necessarily. pg_class is not shared yet without it you can't even
find pg_database. Same deal with pg_type. All it means is that
pg_collation in template1 must contain all the collations used in
template1, which shouldn't be hard to arrange.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 10:00:34
Message-ID: 48733AC2.5050404@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Tue, Jul 08, 2008 at 11:27:35AM +0200, Zdenek Kotala wrote:
>> Zdenek Kotala napsal(a):
>>
>>> I though more about it and I discussed it with Radek yesterday. The
>>> problem is that collation must be created before user want to use CREATE
>>> DATABASE ... COLLATE ... command. It inclines to have have pg_collation
>>> as a global catalog, but ANSI specifies to use schema name in collation
>>> specification and schemes are database specific ... It means that
>>> pg_collation have to be non-shared catalog and new database only
>>> inherits collation from template db. And CREATE DATABASE have to check
>>> list of collation in template database :(.
>> thinking more ...
>> It must be shared catalog because pg_database will depend on it.
>
> Not necessarily. pg_class is not shared yet without it you can't even
> find pg_database. Same deal with pg_type. All it means is that
> pg_collation in template1 must contain all the collations used in
> template1, which shouldn't be hard to arrange.

I think, Collation situation is different, becasue pg_database will contains
column colname. pg_class (and all bootstrap catalog) only contains row which
specify that shared table exists and content is cloned to the new database from
template database. In corner case you can get context specific dependency for
example if Czech collation will have oid=10 in database test01 and Swedish
collation will have oid=10 in database test02. How to handle CREATE DATABASE and
connect database? OK it shouldn't happen in normal situation but ...

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 10:19:18
Message-ID: 20080708101918.GF16671@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 08, 2008 at 12:00:34PM +0200, Zdenek Kotala wrote:
> >Not necessarily. pg_class is not shared yet without it you can't even
> >find pg_database. Same deal with pg_type. All it means is that
> >pg_collation in template1 must contain all the collations used in
> >template1, which shouldn't be hard to arrange.
>
> I think, Collation situation is different, becasue pg_database will
> contains column colname. pg_class (and all bootstrap catalog) only contains
> row which specify that shared table exists and content is cloned to the new
> database from template database. In corner case you can get context
> specific dependency for example if Czech collation will have oid=10 in
> database test01 and Swedish collation will have oid=10 in database test02.
> How to handle CREATE DATABASE and connect database? OK it shouldn't happen
> in normal situation but ...

Oh I see, you're referring to the storage of the default collation for
a database. I was jumping ahead to the per-column collation state, when
the collation default is attached to columns, types and domains, and
not at the database level. So there the problem does not exist.

To be honest, I'd suggest storing the collation in pg_database as a
string, rather than as an identifier. This sidesteps the problem
entirly.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 11:23:50
Message-ID: 48734E46.7080907@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
>
> Oh I see, you're referring to the storage of the default collation for
> a database. I was jumping ahead to the per-column collation state, when
> the collation default is attached to columns, types and domains, and
> not at the database level. So there the problem does not exist.

Yeah, but you still need one source/one collation list for database, scheme,
table and column. And of course shared tables need also collation for their indexes.

> To be honest, I'd suggest storing the collation in pg_database as a
> string, rather than as an identifier. This sidesteps the problem
> entirly.

I don't think that string is good idea. You need to use same approach on all
levels by my opinion.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 13:49:19
Message-ID: 7808.1215524959@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> Martijn van Oosterhout napsal(a):
>> Not necessarily. pg_class is not shared yet without it you can't even
>> find pg_database. Same deal with pg_type. All it means is that
>> pg_collation in template1 must contain all the collations used in
>> template1, which shouldn't be hard to arrange.

> I think, Collation situation is different,

All the argument here is based on the premise that we should have
database-level collation specifications, which AFAICS is not required
nor suggested by the SQL spec. I wonder why we are allowing a
nonstandard half-measure to drive our thinking, rather than solving the
real problem which is column-level collations.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 14:31:44
Message-ID: 8601.1215527504@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> ... And of course shared tables need also collation for their indexes.

No, they don't, because the only textual indexes on shared catalogs are
on "name" columns, which are intentionally not locale aware, and
wouldn't be collation aware either.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 14:54:28
Message-ID: 48737FA4.4080308@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> All the argument here is based on the premise that we should have
> database-level collation specifications, which AFAICS is not required
> nor suggested by the SQL spec. I wonder why we are allowing a
> nonstandard half-measure to drive our thinking, rather than solving the
> real problem which is column-level collations.
>
>
>

Agreed. Are we even sure that we want per-database collations as a
half-way house? Unless we can be sure that we want all the required
catalog changes for the full requirement, it seems to me a rather messy
way of getting to where we want to go.

cheers

andrew


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 15:02:51
Message-ID: 20080708150251.GL16671@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 08, 2008 at 10:54:28AM -0400, Andrew Dunstan wrote:
> Agreed. Are we even sure that we want per-database collations as a
> half-way house? Unless we can be sure that we want all the required
> catalog changes for the full requirement, it seems to me a rather messy
> way of getting to where we want to go.

Given that the current projected timeframe for full COLLATE support is
something like 2-3 major releases, I don't really see the problem with
doing this now. Who knows, it might prompt people to do something
sooner.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 16:31:41
Message-ID: 200807080931.41652.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew, Tom,

> Agreed. Are we even sure that we want per-database collations as a
> half-way house? Unless we can be sure that we want all the required
> catalog changes for the full requirement, it seems to me a rather messy
> way of getting to where we want to go.

Given that we don't have a delivery date for table or column level collations,
we don't want to turn down database-level collations. If nothing else,
Radek's work will expose what areas of our code are collation-dependant and
hopefully make the work of more granular collations easier. And if it takes
us 3 years to get more granular collations, at least people can use
database-level ones in the meantime so that they don't need to have separate
PostgreSQL binaries for every language they want to support fully.

Also ... this is a Summer of Code Project, which we accepted, which at least
in Google and the student's eyes means we're not going to discard the entire
premise of the patch. I'm not exaggerating when I say doing something like
that could get PostgreSQL permanently banned from Google SoC.

Tom, I think you need to be on the SoC committee in the future, just to raise
objections. Some 15+ PostgreSQL contributors on the SoC committee approved
Radek's project.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco


From: Michael Paesold <mpaesold(at)gmx(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 16:33:41
Message-ID: EEA0C5EF-E9F7-4D45-BB3F-4B9EE2302CA2@gmx.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
>> Martijn van Oosterhout napsal(a):
>>> Not necessarily. pg_class is not shared yet without it you can't
>>> even
>>> find pg_database. Same deal with pg_type. All it means is that
>>> pg_collation in template1 must contain all the collations used in
>>> template1, which shouldn't be hard to arrange.
>
>> I think, Collation situation is different,
>
> All the argument here is based on the premise that we should have
> database-level collation specifications, which AFAICS is not required
> nor suggested by the SQL spec. I wonder why we are allowing a
> nonstandard half-measure to drive our thinking, rather than solving
> the
> real problem which is column-level collations.

Wouldn't you still need per-database and per-table default collations?
At least MySQL does have such a concept.

Best Regards
Michael Paesold


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 16:42:55
Message-ID: 4873990F.1090307@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
>> ... And of course shared tables need also collation for their indexes.
>
> No, they don't, because the only textual indexes on shared catalogs are
> on "name" columns, which are intentionally not locale aware, and
> wouldn't be collation aware either.

Yeah, name uses strcmp, which is not locale aware but from ANSI perspective
there is collation SQL_IDENTIFIER for it which is fortunately implementation
defined.

What I see now as the problem is that we need also to know correct collation for
ORDER - for example:

select * from pg_shdescription order by description;

...thinking...

but it should solve by collation per column which will work well with
pg_attribute cloning for new database as Martijn mentioned.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 18:08:56
Message-ID: 12138.1215540536@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Andrew, Tom,
>> Agreed. Are we even sure that we want per-database collations as a
>> half-way house? Unless we can be sure that we want all the required
>> catalog changes for the full requirement, it seems to me a rather messy
>> way of getting to where we want to go.

> Given that we don't have a delivery date for table or column level collations,
> we don't want to turn down database-level collations.

I am one hundred percent prepared to turn them down, if they end up
contorting the design in a way that we will have to undo (with
consequent backwards-compatibility problems) to get to the full feature.

If it's a partial implementation of the full feature, that's fine, but
I'm not getting good vibes about that from the discussions so far.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 19:05:11
Message-ID: 4873BA67.6090600@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
>> Martijn van Oosterhout napsal(a):
>>> Not necessarily. pg_class is not shared yet without it you can't even
>>> find pg_database. Same deal with pg_type. All it means is that
>>> pg_collation in template1 must contain all the collations used in
>>> template1, which shouldn't be hard to arrange.
>
>> I think, Collation situation is different,
>
> All the argument here is based on the premise that we should have
> database-level collation specifications, which AFAICS is not required
> nor suggested by the SQL spec.

Yeah, it is not required, but by my opinion it should be derived from CREATE
SCHEMA statement. There is following item:

--- SQL ANSI 2003 page 520 ---

5) If <schema character set specification> is not specified, then a <schema
character set specification> that specifies an implementation-defined character
set that contains at least every character that is in <SQL language character>
is implicit.

----

It is not for collation directly, but if I understand it correctly when you want
to create schema then default charset is inherit from parent instance which is
database (catalog).

Following sentence specified that pg_collation should be database specific.

---- SQL ANSI 2003 page 15 ---
Character sets defined by standards or by SQL-implementations reside in the
Information Schema (named INFORMATION_SCHEMA) in each catalog, as do collations
defined by standards and collations, transliterations, and transcodings defined
by SQL-implementations.
----

> I wonder why we are allowing a
> nonstandard half-measure to drive our thinking, rather than solving the
> real problem which is column-level collations.

I try to determine how to implement collation itself - collation catalog
structure and content and how to create new collation. Column-level collation is
nice but until we will not have basic infrastructure we cannot start implemented it.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-08 19:16:24
Message-ID: 4873BD08.3010802@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan napsal(a):
>
>
> Tom Lane wrote:
>>
>> All the argument here is based on the premise that we should have
>> database-level collation specifications, which AFAICS is not required
>> nor suggested by the SQL spec. I wonder why we are allowing a
>> nonstandard half-measure to drive our thinking, rather than solving the
>> real problem which is column-level collations.
>>
>>
>>
>
> Agreed. Are we even sure that we want per-database collations as a
> half-way house? Unless we can be sure that we want all the required
> catalog changes for the full requirement, it seems to me a rather messy
> way of getting to where we want to go.

Andrew,
I would like also to see full collation implementation rather then collation per
database. But from my point of view split collation into small parts is much
better. Radek's work is mostly about creating infrastructure for full collation
support. When it will be finished, then "only" changes in parser, executor...
will be necessary to complete a job.

If you look on most discussion about collation they fell into ICU yes/no problem
without any real decision how to implemented the feature.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-09 07:10:55
Message-ID: 20080709071055.GA18930@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 08, 2008 at 09:05:11PM +0200, Zdenek Kotala wrote:
> >All the argument here is based on the premise that we should have
> >database-level collation specifications, which AFAICS is not required
> >nor suggested by the SQL spec.
>
> Yeah, it is not required, but by my opinion it should be derived from
> CREATE SCHEMA statement. There is following item:
>
> --- SQL ANSI 2003 page 520 ---
>
> 5) If <schema character set specification> is not specified, then a <schema

Careful, this is a 'character set specification" which has (almost)
nothing to do with collation. It's closer to the encoding field, which
is already in pg_database.

The issue with having a "default database collation" is that it's
unclear where it would be used. In the end the collation is defined by
the types and domains. Columns inherit from the types. I think the only
senseible definition is to decide that all the text/varchar/char types
inherit from the database. It's not in the spec but I think it does
make easier to decide what the default collation is. As an alternative
to:

ALTER TYPE text SET DEFAULT COLLATION TO foo;
<repeat for all other text types>

> I try to determine how to implement collation itself - collation catalog
> structure and content and how to create new collation. Column-level
> collation is nice but until we will not have basic infrastructure we cannot
> start implemented it.

I agree that this patch includes much basic work that needs to be done
for full collation support.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-09 13:18:32
Message-ID: 4874BAA8.6070904@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Tue, Jul 08, 2008 at 09:05:11PM +0200, Zdenek Kotala wrote:
>>> All the argument here is based on the premise that we should have
>>> database-level collation specifications, which AFAICS is not required
>>> nor suggested by the SQL spec.
>> Yeah, it is not required, but by my opinion it should be derived from
>> CREATE SCHEMA statement. There is following item:
>>
>> --- SQL ANSI 2003 page 520 ---
>>
>> 5) If <schema character set specification> is not specified, then a <schema
>
> Careful, this is a 'character set specification" which has (almost)
> nothing to do with collation. It's closer to the encoding field, which
> is already in pg_database.

Yeah, I'm confuse why for character set is defined behavior when is not
specified but there is not definition how to handle it for collation.

Maybe because charset has default collation.

> The issue with having a "default database collation" is that it's
> unclear where it would be used. In the end the collation is defined by
> the types and domains. Columns inherit from the types. I think the only
> senseible definition is to decide that all the text/varchar/char types
> inherit from the database. It's not in the spec but I think it does
> make easier to decide what the default collation is. As an alternative
> to:
>
> ALTER TYPE text SET DEFAULT COLLATION TO foo;
> <repeat for all other text types>

It is good point I thought about following situation

select 'xxx' < 'yyy';

But how you mentioned because both are text and text type has default collation
there is no reason to have collation per database.

Hmm, It seems to me that expectation should be create all string data types with
default collation during initdb phase. It means modify BKI types definition
according to default locale setting.

Zdenek


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-29 08:41:37
Message-ID: 200807291141.39166.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Tuesday, 1. July 2008 schrieb Radek Strnad:
> I'm sending part of the code that I've done and is available for reviewing
> and I'm asking for your comments and some help because I'm new to
> PostgreSQL.
>
> Proposal: http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php
>
> The code contains changes of parser, pg_collation catalog, parsenodes.h for
> CREATE COLLATION... and DROP COLLATION statements.

This patch is a small start on the way to adding a catalog and some SQL
statements that add and remove entries from it. But I don't see any
collation support here, which is about 99% of the work that is left to do.
Where are the collations going to come from? Have the various build and
distributions issues been thought about? How are they going to be configured
(not the SQL syntax, but how will the configuration be applied)? How are the
collations going to be applied at run-time? How are you going to handle
locale and encoding conflicts? I also think that the clauses you have
attached to your CREATE COLLATION statement (case-insensitive,
accent-insensitive) are an oversimplification of reality. I suggest you look
up the Unicode collation algorithm to learn about who collations work in
practice.

In my opinion, you are starting this project from the wrong end. I would
suggest you approach it like this:

- Find some collation implementations
- Patch PostgreSQL to link with them
- Patch PostgreSQL to apply them for comparison
- Implement system catalog to configure them
- Implement SQL statements to manipulate the system catalog


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Peter Eisentraut" <peter_e(at)gmx(dot)net>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-07-29 10:56:17
Message-ID: 87fxpt3qni.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Peter Eisentraut" <peter_e(at)gmx(dot)net> writes:

> In my opinion, you are starting this project from the wrong end. I would
> suggest you approach it like this:
>
> - Find some collation implementations
> - Patch PostgreSQL to link with them

Well I think the feeling is that we may as well start with the lowest common
denominator of libc's collation implementation. It's the only one everyone's
going to have. Later adding compile-time options to use a different library
and different function calls might be useful but a lot of people are unhappy
about the idea of *requiring* a major outside library for this.

> - Patch PostgreSQL to apply them for comparison

Er, yes. Well we do that already but the tricky bit is keeping track of
multiple collations and applying the right one for each comparison.

So we do need the concept of multiple collations and the syntax to select a
collation for each ordering operation.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: "Peter Eisentraut" <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-08-02 13:39:18
Message-ID: de5165440808020639j4e7226bbu32e840e225e15c3e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

the main reason why I've submitted the patch was to start a discussion and
know other people's opinion on this problem.

On Tue, Jul 29, 2008 at 10:41 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:

>
> Where are the collations going to come from?

There will be two new catalogs - pg_collate and pg_charset. Each of them
will be filled with ANSI standard collations and charsets (ISO8BIT, LATIN1,
UTF-8..) and alternatively with default collation set when creating. For
instance if you create database cluster with initdb and specify en_US.utf8
there will be standard rows (ISO8BIT, LATIN1, UTF-8..) + one row with
en_US.utf8 in template0. Then you can connect to template0 and create other
collations if your POSIX locales support them and use them one per each
database.

Have the various build and distributions issues been thought about?

Yes. Since POSIX locales doesn't guarantee any collation there will be
hard-coded collations implemented regarding ANSI collation standard. Others
can be set by command CREATE COLLATION.

How are they going to be configured (not the SQL syntax, but how will the
> configuration be applied)?

pg_type, pg_attribute, pg_namespace of each database will be extended with
collation oid column that will be specifying collation.

How are the collations going to be applied at run-time?

Collation will be set when connecting to the database with
setlocale(LC_COLLATION, XXX) and setlocale(LC_CTYPE, XXX)

> How are you going to handle locale and encoding conflicts?

Since I'm currently implementing collation support per database I don't
think this is an issue. (It will be in the future I know.)

> I also think that the clauses you have attached to your CREATE COLLATION
> statement (case-insensitive,
> accent-insensitive) are an oversimplification of reality. I suggest you
> look
> up the Unicode collation algorithm to learn about who collations work in
> practice.

I already did in the very beginning of the development. The reason why I'm
not implementing the whole Unicode collation algorithm is that this patch
shold be sort of framework. You'll be able to use different collation
functions not only POSIX locales so further development towards full Unicode
collation algorithm is possible.

At the end of the next week I'll publish my bachelor thesis concerning this
topic where everything will be explained in details so stay tuned.

Regards

Radek Strnad


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] patch - Collation at database level
Date: 2008-08-02 14:07:28
Message-ID: 20080802140727.GB4098@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 02, 2008 at 03:39:18PM +0200, Radek Strnad wrote:
> > I also think that the clauses you have attached to your CREATE
> > COLLATION statement (case-insensitive, accent-insensitive) are an
> > oversimplification of reality. I suggest you look up the Unicode
> > collation algorithm to learn about who collations work in practice.
>
> I already did in the very beginning of the development. The reason why I'm
> not implementing the whole Unicode collation algorithm is that this patch
> shold be sort of framework. You'll be able to use different collation
> functions not only POSIX locales so further development towards full Unicode
> collation algorithm is possible.

Agreed. Ofcourse it's a simplification of reality. POSIX locales are a
simplification of reality, but its the only form of collation currently
available to us. And quite frankly, I don't beleive postgresql should
be in the business of writing collation algorithms, we don't have the
expertese.

FWIW, I think case-insensitive and accent-insensitive are useful modifiers
that we should aim to support in the future.

> At the end of the next week I'll publish my bachelor thesis concerning this
> topic where everything will be explained in details so stay tuned.

Good luck!

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.