Proposal - Collation at database level

From: Radek Strnad <radek(dot)strnad(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Proposal - Collation at database level
Date: 2008-05-28 00:22:39
Message-ID: 1211934159.7362.14.camel@random
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I'm working on implementation of collation at database level using
system locales as a Google Summer of Code 2008 project. You can read my
proposal on the wiki page -
http://wiki.postgresql.org/wiki/Gsoc08-collation . I'm building this
over Alexey Slynko's patch sent two years ago
(http://www.activebait.net/msg00019.html). Currently I'm in stage of
creating catalogs for collations, repertoires, encodings and charsets.
Because of every single system is containing different locales we have
to guarantee at least those specified in SQL standard. SQL 2003 standard
can be downloaded at http://www.wiscorp.com/sql_2003_standard.zip (~18
MB). File 5wd-02-foundation-2003-09.pdf is specifying the foundation of
collations. One or more collations must be specified out of these:

— SQL_CHARACTER is an implementation-defined collation. It is applicable
to the SQL_CHARACTER
character repertoire.
— GRAPHIC_IRV is a collation in which the ordering is determined by
treating the code points defined by
ISO 646:1991 as unsigned integers. It is applicable to the GRAPHIC_IRV
character repertoire.
— LATIN1 is a collation in which the ordering is determined by treating
the code points defined by ISO 8859-
1 as unsigned integers. It is applicable to the LATIN1 character
repertoire.
— ISO8BIT is a collation in which the ordering is determined by treating
the code points defined by ISO
8859-1 as unsigned integers. When restricted to the LATIN1 characters,
it produces the same collation as
LATIN1. It is applicable to the ISO8BIT character repertoire.
— UCS_BASIC is a collation in which the ordering is determined entirely
by the Unicode scalar values of
the characters in the strings being sorted. It is applicable to the
UCS character repertoire. Since every
character repertoire is a subset of the UCS repertoire, the UCS_BASIC
collation is potentially applicable
to every character set.
NOTE 11 — The Unicode scalar value of a character is its code point
treated as an unsigned integer.
— UNICODE is the collation in which the ordering is determined by
applying the Unicode Collation Algorithm
with the Default Unicode Collation Element Table, as specified in
[Unicode10]. It is applicable to the UCS
character repertoire. Since every character repertoire is a subset of
the UCS repertoire, the UNICODE
collation is potentially applicable to every character set.
— SQL_TEXT is an implementation-defined collation. It is applicable to
the SQL_TEXT character repertoire.
— SQL_IDENTIFIER is an implementation-defined collation. It is
applicable to the SQL_IDENTIFIER
character repertoire.

I'm thinking of dividing the problem into two parts - in beginning
pg_collation will contain two functions. One will have hard-coded rules
for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT,
UCS_BASIC). It will compare each string character bitwise and guarantee
that the implementation will meet the SQL standard implemented in
PostgreSQL.

Second one will allow the user to use installed system locales. The set
of these collations will obviously vary between systems. Catalogs will
contain encoding and collation for calling the system locale function.
This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592
etc. if they will be availible.

We will also need to change the way how strings are compared. Regarding
the set database collation the right function will be used.
http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675

This design will make possible switch to ICU or any other implementation
quite simple and will not cause any major rewriting of what I'm coding
right now.

Catalogs specification with SQL 2003 standard SQL commands for creating
tables follows:

=============================
pg_repertoires
=============================
CREATE TABLE CHARACTER_REPERTOIRES (
CHARACTER_REPERTOIRE_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT CHARACTER_REPERTOIRE_NAME_NOT_NULL
NOT NULL,
DEFAULT_COLLATION_CATALOG INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT
CHARACTER_REPERTOIRES_DEFAULT_COLLATION_CATALOG_NOT_NULL
NOT NULL,
DEFAULT_COLLATION_SCHEMA INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT CHARACTER_REPERTOIRES_DEFAULT_COLLATION_SCHEMA_NOT_NULL
NOT NULL,
DEFAULT_COLLATION_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT CHARACTER_REPERTOIRES_DEFAULT_COLLATION_NAME_NOT_NULL
NOT NULL,
CONSTRAINT CHARACTER_REPERTOIRES_PRIMARY_KEY
PRIMARY KEY ( CHARACTER_REPERTOIRE_NAME ),
CONSTRAINT CHARACTER_REPERTOIRES_FOREIGN_KEY_COLLATIONS
FOREIGN KEY ( DEFAULT_COLLATION_CATALOG, DEFAULT_COLLATION_SCHEMA,
DEFAULT_COLLATION_NAME )
REFERENCES COLLATIONS
)

CATALOG(pg_repertoires, ###)
{
NameData repname; /* repertoire name */
Oid repdefcolloid; /* default collation catalog */
Oid repdefcolschema; /* default collation schema */
NameData repsysname; /* used repertoire - system or hard-coded */

} FormData_pg_repertoires;

=============================
pg_collation
=============================
CREATE TABLE COLLATIONS (
COLLATION_CATALOG INFORMATION_SCHEMA.SQL_IDENTIFIER,
COLLATION_SCHEMA INFORMATION_SCHEMA.SQL_IDENTIFIER,
COLLATION_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER,
PAD_ATTRIBUTE INFORMATION_SCHEMA.CHARACTER_DATA
CONSTRAINT COLLATIONS_PAD_ATTRIBUTE_CHECK
CHECK ( PAD_ATTRIBUTE IN
( 'NO PAD', 'PAD SPACE' ) ),
COLLATION_TYPE INFORMATION_SCHEMA.SQL_IDENTIFIER,
COLLATION_DEFINITION INFORMATION_SCHEMA.CHARACTER_DATA,
COLLATION_DICTIONARY INFORMATION_SCHEMA.CHARACTER_DATA,
CHARACTER_REPERTOIRE_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT CHARACTER_REPERTOIRE_NAME_NOT_NULL
NOT NULL,
CONSTRAINT COLLATIONS_PRIMARY_KEY
PRIMARY KEY ( COLLATION_CATALOG, COLLATION_SCHEMA,
COLLATION_NAME ),
CONSTRAINT COLLATIONS_FOREIGN_KEY_SCHEMATA
FOREIGN KEY ( COLLATION_CATALOG, COLLATION_SCHEMA )
REFERENCES SCHEMATA
)

CATALOG(pg_collations, ###)
{
NameData colname; /* collation name */
Oid colschema; /* collation schema */
bool colpadattribute; /* pad attribute */
bool colcasesensitive; /* case sensitive */
bool colaccent; /* accent sensitive */
regproc colfunc; /* used collation function */
Oid colrepertoire; /* collation repertoire */

} FormData_pg_collations;

- COLLATION_TYPE, COLLATION_DEFINITION, COLLATION_DICTIONARY are by NULL
by standard. Will be created by view

=============================
pg_charset
=============================
CREATE TABLE CHARACTER_SETS (
CHARACTER_SET_CATALOG INFORMATION_SCHEMA.SQL_IDENTIFIER,
CHARACTER_SET_SCHEMA INFORMATION_SCHEMA.SQL_IDENTIFIER,
CHARACTER_SET_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER,
CHARACTER_REPERTOIRE INFORMATION_SCHEMA.SQL_IDENTIFIER,
FORM_OF_USE INFORMATION_SCHEMA.SQL_IDENTIFIER,
NUMBER_OF_CHARACTERS INFORMATION_SCHEMA.CARDINAL_NUMBER,
DEFAULT_COLLATE_CATALOG INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT CHARACTER_SETS_DEFAULT_COLLATE_CATALOG_NOT_NULL
NOT NULL,
DEFAULT_COLLATE_SCHEMA INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT CHARACTER_SETS_DEFAULT_COLLATE_SCHEMA_NOT_NULL
NOT NULL,
DEFAULT_COLLATE_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT CHARACTER_SETS_DEFAULT_COLLATE_NAME_NOT_NULL
NOT NULL,
CONSTRAINT CHARACTER_SETS_PRIMARY_KEY
PRIMARY KEY ( CHARACTER_SET_CATALOG, CHARACTER_SET_SCHEMA,
CHARACTER_SET_NAME ),
CONSTRAINT CHARACTER_SETS_FOREIGN_KEY_SCHEMATA
FOREIGN KEY ( CHARACTER_SET_CATALOG, CHARACTER_SET_SCHEMA )
REFERENCES SCHEMATA,
CONSTRAINT CHARACTER_SETS_FOREIGN_KEY_CHARACTER_ENCODING_FORMS
FOREIGN KEY (FORM_OF_USE, CHARACTER_REPERTOIRE )
REFERENCES CHARACTER_ENCODING_FORMS,
CONSTRAINT CHARACTER_SETS_CHECK_REFERENCES_COLLATIONS
CHECK ( DEFAULT_COLLATE_CATALOG NOT IN
( SELECT CATALOG_NAME FROM SCHEMATA )schema
OR
( DEFAULT_COLLATE_CATALOG, DEFAULT_COLLATE_SCHEMA,
DEFAULT_COLLATE_NAME ) IN
( SELECT COLLATION_CATALOG, COLLATION_SCHEMA,
COLLATION_NAME
FROM COLLATIONS ) )
)

CATALOG(pg_charset, ###)
{
NameData chaname; /* character set name */
Oid chaschema; /* character set schema */
Oid charepertoire; /* repertoire oid */
NameData chaformofuse; /* character encoding form */
int4 chanumofcharacters; /* number of characters */
Oid chadefcollcatalog; /* default collate catalog */
Oid chadefcollschema; /* default collate schema */
NameData chadefcollname; /* default collate name */

} FormData_pg_charset;

=============================
pg_encoding
=============================
CREATE TABLE CHARACTER_ENCODING_FORMS (
CHARACTER_REPERTOIRE_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT
CHARACTER_ENCODING_FORMS_CHARACTER_REPERTOIRE_NAME_NOT_NULL
NOT NULL,
CHARACTER_ENCODING_FORM_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER
CONSTRAINT
CHARACTER_ENCODING_FORMS_CHARACTER_ENCODING_FORM_NAME_NOT_NULL
NOT NULL,
CONSTRAINT CHARACTER_ENCODING_FORMS_PRIMARY_KEY
PRIMARY KEY ( CHARACTER_ENCODING_FORM_NAME,
CHARACTER_REPERTOIRE_NAME ),
CONSTRAINT
CHARACTER_ENCODING_FORMS_FOREIGN_KEY_CHARACTER_REPERTOIRES
FOREIGN KEY ( CHARACTER_REPERTOIRE_NAME )
REFERENCES CHARACTER_REPERTOIRES
)

CATALOG(pg_encoding, ###)
{
NameData encname; /* encoding name */
NameData encsystemencoding; /* system or built-in encoding */

} FormData_pg_encoding;

Regarding the Alexey Slynko's patch mentioned earlier pg_database has
been also extended with lc_collate and lc_ctype records that currently
sets lc_collate and lc_ctype per each database.

Please let me know if my idea is right or needs some adjusts. Thank you

Regards

Radek Strnad

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tommy Gildseth 2008-05-28 08:23:10 Add dblink function to check if a named connection exists
Previous Message Tom Lane 2008-05-27 23:32:45 Re: Hint Bits and Write I/O