[WIP] collation support revisited (phase 1)

Lists: pgsql-hackers
From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [WIP] collation support revisited (phase 1)
Date: 2008-07-10 21:24:29
Message-ID: de5165440807101424l14fb535byf43fc665351c4dfd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

after long discussion with Mr. Kotala, we've decided to redesign our
collation support proposal.
For those of you who aren't familiar with my WIP patch and comments from
other hackers here's the original mail:
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php

In a few sentences - I'm writing collation support for PostgreSQL that is
almost independent on used collating function. I will implement POSIX
locales but switch to ICU will be quite easy. Collations and character sets
defined by SQL standard will be hard coded so we avoid non-existence in some
functions.

The whole project will be divided into two phases:

phase 1
Implement "sort of framework" so the PostgreSQL will have basic guts
(pg_collation & pg_charset catalogs, CREATE COLLATION, add collation support
for each type needed) and will support collation at database level. This
phase has been accepted as a Google Summer of Code project.

phase 2
Implement the rest - full collation at column level. I will continue working
on this after finishing phase one and it will be my master degree thesis.

How will the first part work?

Catalogs
- new catalogs pg_collation and pg_charset will be defined
- pg_collation and pg_charset will contain SQL standard collations +
optional default collation (when set other than SQL standard one)
- pg_type, pg_attribute, pg_namespace will be extended with references to
default records in pg_collation and pg_charset

initdb
- pg_collation & pg_charset will contain each pre-defined records regarding
SQL standard and optionally one record that will be non-standard set when
creating initdb (the one using system locales)
- these two records will be referenced by pg_type, pg_attribute,
pg_namespace in concerned columns and will be concidered as default
collation that will be inherited

CREATE DATABASE ... COLLATE ...
- after copying the new database the collation will be default (same as
cluster collation) or changed by COLLATE statement. Then we update pg_type,
pg_attribute and pg_namespace catalogs
- reindex database

When changing databases the database collation will be retrieved from type
text from pg_type. This part should be the only one that will be deleted
when proceeding with phase 2. But that will take a while :-)

Thanks for all your comments

Regards

Radek Strnad


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-11 07:29:18
Message-ID: 20080711072918.GA11710@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 10, 2008 at 11:24:29PM +0200, Radek Strnad wrote:
> Hi,
>
> after long discussion with Mr. Kotala, we've decided to redesign our
> collation support proposal.
> For those of you who aren't familiar with my WIP patch and comments from
> other hackers here's the original mail:
> http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php

<snip>
> phase 1
> Implement "sort of framework" so the PostgreSQL will have basic guts
> (pg_collation & pg_charset catalogs, CREATE COLLATION, add collation support
> for each type needed) and will support collation at database level. This
> phase has been accepted as a Google Summer of Code project.

Why bother with pg_charset? I don't think I've seen anyone even asking
for multiple charsets in a single DB and certainly no actual use case.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-11 19:52:47
Message-ID: 4877BA0F.9030704@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Thu, Jul 10, 2008 at 11:24:29PM +0200, Radek Strnad wrote:
>> Hi,
>>
>> after long discussion with Mr. Kotala, we've decided to redesign our
>> collation support proposal.
>> For those of you who aren't familiar with my WIP patch and comments from
>> other hackers here's the original mail:
>> http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php
>
> <snip>
>> phase 1
>> Implement "sort of framework" so the PostgreSQL will have basic guts
>> (pg_collation & pg_charset catalogs, CREATE COLLATION, add collation support
>> for each type needed) and will support collation at database level. This
>> phase has been accepted as a Google Summer of Code project.
>
> Why bother with pg_charset? I don't think I've seen anyone even asking
> for multiple charsets in a single DB and certainly no actual use case.

The example is when you have translation data (vocabulary) in database. But the
reason is that ANSI specify (chapter 4.2) charset as a part of string
descriptor. See below:

— The length or maximum length in characters of the character string type.
— The catalog name, schema name, and character set name of the character set of
the character string type.
— The catalog name, schema name, and collation name of the collation of the
character string type.

If I think about it, the question is if we need both information in pg_type and
so on, because collation automatically define charset.

any idea?

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-11 21:01:20
Message-ID: 20080711210120.GL4110@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala escribió:

> The example is when you have translation data (vocabulary) in database.
> But the reason is that ANSI specify (chapter 4.2) charset as a part of
> string descriptor. See below:
>
> — The length or maximum length in characters of the character string type.
> — The catalog name, schema name, and character set name of the character
> set of the character string type.
> — The catalog name, schema name, and collation name of the collation of
> the character string type.

We already support multiple charsets, and are able to do conversions
between them. The set of charsets is hardcoded and it's hard to make a
case that a user needs to create new ones. I concur with Martijn's
suggestion -- there's no need for this to appear in a system catalog.

Perhaps it could be argued that we need to be able to specify the
charset a given string is in -- currently all strings are in the server
encoding (charset) which is fixed at initdb time. Making the system
support multiple server encodings would be a major undertaking in itself
and I'm not sure that there's a point.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-12 08:02:24
Message-ID: 48786510.9080502@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera napsal(a):
> Zdenek Kotala escribió:
>
>> The example is when you have translation data (vocabulary) in database.
>> But the reason is that ANSI specify (chapter 4.2) charset as a part of
>> string descriptor. See below:
>>
>> — The length or maximum length in characters of the character string type.
>> — The catalog name, schema name, and character set name of the character
>> set of the character string type.
>> — The catalog name, schema name, and collation name of the collation of
>> the character string type.
>
> We already support multiple charsets, and are able to do conversions
> between them. The set of charsets is hardcoded and it's hard to make a
> case that a user needs to create new ones. I concur with Martijn's
> suggestion -- there's no need for this to appear in a system catalog.
>
> Perhaps it could be argued that we need to be able to specify the
> charset a given string is in -- currently all strings are in the server
> encoding (charset) which is fixed at initdb time. Making the system
> support multiple server encodings would be a major undertaking in itself
> and I'm not sure that there's a point.
>

Background:
We specify encoding in initdb phase. ANSI specify repertoire, charset, encoding
and collation. If I understand it correctly, then charset is subset of
repertoire and specify list of allowed characters for language->collation.
Encoding is mapping of character set to binary format. For example for Czech
alphabet(charset) we have 6 different encoding for 8bit ASCII, but on other side
for UTF8 there is specified multi charsets.

I think if we support UTF8 encoding, than it make sense to create own charsets,
because system locales could have defined collation for that. We need conversion
only in case when client encoding is not compatible with charset and conversion
is not defined.

Any comments?

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-12 13:08:43
Message-ID: 20080712130843.GA12026@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
> Background:
> We specify encoding in initdb phase. ANSI specify repertoire, charset,
> encoding and collation. If I understand it correctly, then charset is
> subset of repertoire and specify list of allowed characters for
> language->collation. Encoding is mapping of character set to binary format.
> For example for Czech alphabet(charset) we have 6 different encoding for
> 8bit ASCII, but on other side for UTF8 there is specified multi charsets.

Oh, so you're thinking of a charset as a sort of check constraint. If
your locale is turkish and you have a column marked charset ASCII then
storing lower('HI') results in an error.

A collation must be defined over all possible characters, it can't
depend on the character set. That doesn't mean sorting in en_US must do
something meaningful with japanese characters, it does mean it can't
throw an error (the usual procedure is to sort on unicode point).

> I think if we support UTF8 encoding, than it make sense to create own
> charsets, because system locales could have defined collation for that. We
> need conversion only in case when client encoding is not compatible with
> charset and conversion is not defined.

The problem is that locales in POSIX are defined on an encoding, not a
charset. In locale en_US.UTF-8 doesn't actually sort any differently
than en_US.latin1, it's just that japanese characters are not
representable in the latter.

locale-gen can create a locale for any pair of (locale code,encoding),
whether the result is meaningful is another question.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-12 15:17:03
Message-ID: 13979.1215875823@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> I think if we support UTF8 encoding, than it make sense to create own charsets,
> because system locales could have defined collation for that.

Say what? I cannot imagine a scenario in which a user-defined encoding
would be useful. The amount of infrastructure you need for a new
encoding is so large that providing management commands is just silly
--- anyone who can create the infrastructure can do the last little bit
for themselves. The analogy to index access methods is on point, again.

regards, tom lane


From: "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, "Radek Strnad" <radek(dot)strnad(at)gmail(dot)com>
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-21 01:15:56
Message-ID: de5165440807201815m758283eejcd1fb5533783f584@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I was trying to sort out the problem with not creating new catalog for
character sets and I came up following ideas. Correct me if my ideas are
wrong.

Since collation has to have a defined character set I'm suggesting to use
already written infrastructure of encodings and to use list of encodings in
chklocale.c. Currently databases are not created with specified character
set but with specified encoding. I think instead of pointing a record in
collation catalog to another record in character set catalog we might use
only name (string) of the encoding.

So each collation will be set over these encodings set in chklocale.c. Each
database will be able to use only collations that are created over same
("compatible") encodings regarding encoding_match_list. Each standard
collation (SQL standard) will be defined over all possible encodings
(hard-coded).

Comments?

Regards

Radek Strnad

On Sat, Jul 12, 2008 at 5:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> > I think if we support UTF8 encoding, than it make sense to create own
> charsets,
> > because system locales could have defined collation for that.
>
> Say what? I cannot imagine a scenario in which a user-defined encoding
> would be useful. The amount of infrastructure you need for a new
> encoding is so large that providing management commands is just silly
> --- anyone who can create the infrastructure can do the last little bit
> for themselves. The analogy to index access methods is on point, again.
>
> regards, tom lane
>


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-22 06:33:21
Message-ID: 20080722063321.GA32691@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:
> I was trying to sort out the problem with not creating new catalog for
> character sets and I came up following ideas. Correct me if my ideas are
> wrong.
>
> Since collation has to have a defined character set.

Not really. AIUI at least glibc and ICU define a collation over all
possible characters (ie unicode). When you create a locale you take a
subset and use that. Think about it: if you want to sort strings and
one of them happens to contain a chinese charater, it can't *fail*.
Note strcoll() has no error return for unknown characters.

> I'm suggesting to use
> already written infrastructure of encodings and to use list of encodings in
> chklocale.c. Currently databases are not created with specified character
> set but with specified encoding. I think instead of pointing a record in
> collation catalog to another record in character set catalog we might use
> only name (string) of the encoding.

That's reasonable. From an abstract point of view collations and
encodings are orthoginal, it's only when you're using POSIX locales
that there are limitations on how you combine them. I think you can
assume a collation can handle any characters that can be produced by
encoding.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-22 14:32:39
Message-ID: 4885EF87.4020608@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
>> Background:
>> We specify encoding in initdb phase. ANSI specify repertoire, charset,
>> encoding and collation. If I understand it correctly, then charset is
>> subset of repertoire and specify list of allowed characters for
>> language->collation. Encoding is mapping of character set to binary format.
>> For example for Czech alphabet(charset) we have 6 different encoding for
>> 8bit ASCII, but on other side for UTF8 there is specified multi charsets.
>
> Oh, so you're thinking of a charset as a sort of check constraint. If
> your locale is turkish and you have a column marked charset ASCII then
> storing lower('HI') results in an error.

Yeah, if you use strcoll function it fails when illegal character is found.
See
http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

> A collation must be defined over all possible characters, it can't
> depend on the character set. That doesn't mean sorting in en_US must do
> something meaningful with japanese characters, it does mean it can't
> throw an error (the usual procedure is to sort on unicode point).

Collation cannot be defined on any character. There is not any relation between
Latin and Chines characters. Collation has sense when you are able to specify <
= > operators.

If you need compare Japanese and Latin characters then ansi specify default
collation for each repertoire. I think it is usually bitwise comparing.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-22 16:36:13
Message-ID: 48860C7D.7030304@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
>> I think if we support UTF8 encoding, than it make sense to create own charsets,
>> because system locales could have defined collation for that.
>
> Say what? I cannot imagine a scenario in which a user-defined encoding
> would be useful.

I did not mean user defined encoding but user defined charset. For example
cs_CZ(dot)UTF-8(at)euro locale uses UTF8 encoding and collation is defined on czech
charset which specifies list of allowed character. If somebody will have
installed e.g. Turkish locale then he will want to have also Turkish charset in
postgres. I guess, Charset also defines how upper/lower case will work (see
i/I in Turkish).

Please, correct me if I wrong.

thanks Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-22 17:03:26
Message-ID: 488612DE.5060206@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:
>> I was trying to sort out the problem with not creating new catalog for
>> character sets and I came up following ideas. Correct me if my ideas are
>> wrong.
>>
>> Since collation has to have a defined character set.
>
> Not really. AIUI at least glibc and ICU define a collation over all
> possible characters (ie unicode). When you create a locale you take a
> subset and use that. Think about it: if you want to sort strings and
> one of them happens to contain a chinese charater, it can't *fail*.
> Note strcoll() has no error return for unknown characters.

It has.
See http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

The strcoll() function may fail if:

[EINVAL]
[CX] The s1 or s2 arguments contain characters outside the domain of
the collating sequence.

>> I'm suggesting to use
>> already written infrastructure of encodings and to use list of encodings in
>> chklocale.c. Currently databases are not created with specified character
>> set but with specified encoding. I think instead of pointing a record in
>> collation catalog to another record in character set catalog we might use
>> only name (string) of the encoding.
>
> That's reasonable. From an abstract point of view collations and
> encodings are orthoginal, it's only when you're using POSIX locales
> that there are limitations on how you combine them. I think you can
> assume a collation can handle any characters that can be produced by
> encoding.

I think you are not correct. You cannot use collation over all UNICODE. See
http://www.unicode.org/reports/tr10/#Common_Misperceptions. Same characters can
be ordered differently in different languages.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Radek Strnad <radek(dot)strnad(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] collation support revisited (phase 1)
Date: 2008-07-24 06:32:35
Message-ID: 20080724063235.GA3152@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 22, 2008 at 04:32:39PM +0200, Zdenek Kotala wrote:
> >Oh, so you're thinking of a charset as a sort of check constraint. If
> >your locale is turkish and you have a column marked charset ASCII then
> >storing lower('HI') results in an error.
>
> Yeah, if you use strcoll function it fails when illegal character is found.
> See
> http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

Wierd, at least in glibc and ICU it can't happen but perhaps there are
other implementations where it can...

> Collation cannot be defined on any character. There is not any relation
> between
> Latin and Chines characters. Collation has sense when you are able to
> specify < = > operators.

There is no standardised relation. However, if your collation library
decides to define all chinese characters after all latin characters,
they will have defined a collation that will work for all strings with
any characters... Which is basically the approach glibc/ICU takes.

I think the standard is kind of pathetic to say that strcoll can set
errno but have no value to indicate error. I wonder how many platforms
actually use that "feature".

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.