For review: Initial support for COLLATE

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: pgsql-patches(at)postgresql(dot)org
Subject: For review: Initial support for COLLATE
Date: 2005-09-07 16:11:17
Message-ID: 20050907161112.GA10273@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

[Please CC any replies, thanks]

This patch is the beginnings of support for COLLATE. I need to do some
other work for a few days so I'm posting here to get some initial
reviews. Various parts are marked [done] and [not done].

The steps involved are:

- Add COLLATE to grammer as part of expression tree. New CollateClause
node stores the relevent facts. [done]

- parse_expr goes through tree and determines the appropriate COLLATE
state for each node as per SQL spec. [done]

Note: PostgreSQL doesn't really have a way to identify text-like
types and there's no real need to anyway. The implementation allows
any node to have a collate state. This can be used for other things,
see below.

- CREATE COLLATE statement [not done, currently added on the fly]

- Two new datatypes, pg_locale_t and pg_localedata_t. The former
represents a locale (and will eventually have an OID). The latter is
an anonymous cookie that stores locale specific information and is
passed to functions that need it. It may be that once done, this
latter type will vanish again. [done]

- Several utility functions in the new file pg_xlocale.c for use by the
rest of the system. [done]

- Add boolean column 'proislocalized' to pg_proc which indicates if the
output of this function is affected by the LOCALE. eg textcat doesn't
care but textle does. [done] This is for:

a) So the parser can complain about functions that look at the
locale/collate order but it's not clear from the arguments (state
None per SQL spec) and it's not specified explicitly. [not done]

b) So when a column or function is indexed, the index code knows if
the locale is relevent to sorting order. This is particularly
interesting for btree indexes on character strings. [not done]

Currently I've marked 58 functions as being locale sensetive, but
that list will need careful going over. To some extent it can be
checked automatically by examining which backend functions use the
new PG_GETLOCALE() macro.

- Check for correct encoding in loaded locales [not done]

- Check the partial indexes do the right thing when matching
expressions. [not done]

- Docs, regression, etc...

- make check: right now I'm getting some regressions in the rules
and plpgsql, very odd... Possibly due to the fact that rules get
collate nodes with locales that don't persist across invokations.

Goals:

- This setup extends the SQL spec a bit, in the sense that COLLATE can
be attached to anything. It is my intention to allow functions such
as to_char() and to_timestamp() to be localized. eg:

test=# select cash_out('1.00'::money collate 'nl_NL'), cash_out('1.00'::money collate 'en_AU');
cash_out | cash_out
----------+----------
EUR1,00 | $1.00
(1 row)

- Should LOCALE be created as a synonym for COLLATE? It reads more
naturally.

- Currently LC_COLLATE is fixed at initdb and LC_NUMERIC and
LC_CURRENCY can be altered. The idea is that eventually even
LC_COLLATE can be altered anytime (from the users point view anyway),
as any objects that care will store the collate order they want and
can execute functions as appropriate.

- Eventually once the transition to full locale support is complete,
change the backend so it always runs under locale C and only
functions on userdata are affected by locales. This means that
unquoted identifiers when converted to lowercase will be lowered as
per ASCII rules. Judging by [1] this is what people want, but
feedback would be nice.

[1] http://www.mail-archive.com/pgsql-hackers(at)postgresql(dot)org/msg39742.html

Hence, right now, names compare using normal strcmp, but text,
varchar, etc use strcoll.

- The patch as it currently stands won't compile at all without system
xlocale support. The goal is to provide some level of backward
compatability where the COLLATE clause can be used, but only to
affect functions like to_char(). Changing COLLATE order would be
forbidden (ie just like now).

- Eventually, once the parts relevent to locales are sufficiently
abstracted, look into ICU to plug it in. Unfortunatly, it's a
completely different model (all utf-16) so that's another phase
altogether.

- The default type output functions should never be locale specific.
This is to avoid issues with pgdump and frontends. Create a function
'localize(anyelement)' to "do the obvious" to force it to happen.

Download:
Compressed 57K, uncompressed >500K but that's due to rewriting the
whole of pg_proc. The important code is not so big. Against todays
CVS.

http://svana.org/kleptog/pgsql/collate2.patch.gz

Examples:

test=# SELECT text('a') < text('B') COLLATE 'C', text('a') < text('B') COLLATE 'en_US.UTF-8';
?column? | ?column?
----------+----------
f | t
(1 row)

test=# SELECT text('A') < text('b') COLLATE 'C', text('A') < text('b') COLLATE 'en_US.UTF-8';
?column? | ?column?
----------+----------
t | t
(1 row)

test=# SELECT text('A') COLLATE 'en_US.UTF-8' < text('b') COLLATE 'C';
ERROR: Conflicting COLLATE clauses

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2005-09-07 17:08:24 Re: statement logging / extended query protocol issues
Previous Message Oliver Jowett 2005-09-06 07:47:28 Re: statement logging / extended query protocol issues