PGDay.it collation discussion notes

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: PGDay.it collation discussion notes
Date: 2008-10-18 06:41:12
Message-ID: 87abd21k3b.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Radek, Zdenek, and Heikki had an extended discussion here at PGDay.it on
collation support. I was volunteered to be the note-taker (!). Here is the
plan we came up with:

Firstly, the ANSI standard makes collations into schema-qualified names which
pretty much forces us to have the collation catalog table.

The ANSI standard syntax where the COLLATION keyword can follow just about any
string value in an expression and then bubbles up the expression until an
operation needs to pick a collation seemed very weird to us. Hooking that into
the parser was pretty intimidating but we thought we could postpone that till
later and solve the meat of the problem first.

Instead, at least for now, we'll have a guc variable which decides the current
collation. Eventually that would just be the default if the query (or column
or whatever) doesn't override it with an explicitly selected collation. (This
would mean we need to treat the default collation like search_path for things
like plpgsql cached plans etc. Perhaps we could have a flag to indicate
whether plans actually need to be invalidated or are collation-independent)

In a query the current collation would be grabbed by the planner when it's
deciding on operators. The planner would grab that collation oid and track it
in the sort key along with the nulls first/last and asc/desc properties.

It's important it do this immediately in the planner rather than just
reference the guc later because eventually it needs to possible to override it
at this point and also so that if you change the guc during the query
execution (ie, in a function) it would not affect the execution of your
current query.

pg_index would need an additional column to go along with indkey and
indoptions. It would be an oidvector of collation oids.

When planning a query or building an index we need to determine whether the
opclass is collation-aware or not. It seemed best that we not hard code which
data types are collation aware and instead look at either the operator or the
pg_proc entry for the function implementing the sort operator or perhaps the
btproc for the opclass to see if it takes a third argument. Or perhaps we
would have a boolean column in pg_opclass which asserts they do and that it
should be the collation. If so then the collation is stored in the sort key
and the indcollation slots and is passed to the operator or the btproc
function as the third argument.

We think even without the parser changes this would be useful enough to take
on its own. It would allow having indexes built in different collations and
have different sessions use different collations. You could even hack queries
which combine collations by defining a function handle subparts of the query
and attach the guc the same way we attach the search_path to security definer
functions.

To finish the feature the catalog needs to add a default collation to every
column and some other database objects. Then the parser needs to check those
objects first before falling back to the session variable. Then it needs to
bubble that value up as it builds expressions so that it's available at the
comparison operator or sort node. We didn't read the part of the spec which
covered the rules for this but it does have rules which should turn up if you
search for "collation derivation".

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-10-18 08:11:02 Re: Hot Standby: First integrated patch
Previous Message Eric Haszlakiewicz 2008-10-18 05:21:49 two servers on the same port