PGDay.it collation discussion notes

Lists: pgsql-hackers
From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: PGDay.it collation discussion notes
Date: 2008-10-18 06:41:12
Message-ID: 87abd21k3b.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Radek, Zdenek, and Heikki had an extended discussion here at PGDay.it on
collation support. I was volunteered to be the note-taker (!). Here is the
plan we came up with:

Firstly, the ANSI standard makes collations into schema-qualified names which
pretty much forces us to have the collation catalog table.

The ANSI standard syntax where the COLLATION keyword can follow just about any
string value in an expression and then bubbles up the expression until an
operation needs to pick a collation seemed very weird to us. Hooking that into
the parser was pretty intimidating but we thought we could postpone that till
later and solve the meat of the problem first.

Instead, at least for now, we'll have a guc variable which decides the current
collation. Eventually that would just be the default if the query (or column
or whatever) doesn't override it with an explicitly selected collation. (This
would mean we need to treat the default collation like search_path for things
like plpgsql cached plans etc. Perhaps we could have a flag to indicate
whether plans actually need to be invalidated or are collation-independent)

In a query the current collation would be grabbed by the planner when it's
deciding on operators. The planner would grab that collation oid and track it
in the sort key along with the nulls first/last and asc/desc properties.

It's important it do this immediately in the planner rather than just
reference the guc later because eventually it needs to possible to override it
at this point and also so that if you change the guc during the query
execution (ie, in a function) it would not affect the execution of your
current query.

pg_index would need an additional column to go along with indkey and
indoptions. It would be an oidvector of collation oids.

When planning a query or building an index we need to determine whether the
opclass is collation-aware or not. It seemed best that we not hard code which
data types are collation aware and instead look at either the operator or the
pg_proc entry for the function implementing the sort operator or perhaps the
btproc for the opclass to see if it takes a third argument. Or perhaps we
would have a boolean column in pg_opclass which asserts they do and that it
should be the collation. If so then the collation is stored in the sort key
and the indcollation slots and is passed to the operator or the btproc
function as the third argument.

We think even without the parser changes this would be useful enough to take
on its own. It would allow having indexes built in different collations and
have different sessions use different collations. You could even hack queries
which combine collations by defining a function handle subparts of the query
and attach the guc the same way we attach the search_path to security definer
functions.

To finish the feature the catalog needs to add a default collation to every
column and some other database objects. Then the parser needs to check those
objects first before falling back to the session variable. Then it needs to
bubble that value up as it builds expressions so that it's available at the
comparison operator or sort node. We didn't read the part of the spec which
covered the rules for this but it does have rules which should turn up if you
search for "collation derivation".

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-18 10:51:20
Message-ID: 20081018105120.GA1320@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 18, 2008 at 07:41:12AM +0100, Gregory Stark wrote:
> The ANSI standard syntax where the COLLATION keyword can follow just about any
> string value in an expression and then bubbles up the expression until an
> operation needs to pick a collation seemed very weird to us. Hooking that into
> the parser was pretty intimidating but we thought we could postpone that till
> later and solve the meat of the problem first.

It's quite straightforward. Every expression has a collation, the COLLATE
keyword just overrides it. And the collation is a parameter of the
operators/functions that want to use it. Implementation is also
straightforward: add expr :: expr COLLATE ident to the bison code and a
collation oid to the Expr node. The code to bubble up the collation ids
would be in the patch I posted a few years back.

> pg_index would need an additional column to go along with indkey and
> indoptions. It would be an oidvector of collation oids.

Have you decided what the collation oid will refer to? If the index is
a descending index, does that affect the collation oid?

> When planning a query or building an index we need to determine whether the
> opclass is collation-aware or not. It seemed best that we not hard code which
> data types are collation aware and instead look at either the operator or the
> pg_proc entry for the function implementing the sort operator or perhaps the
> btproc for the opclass to see if it takes a third argument. Or perhaps we
> would have a boolean column in pg_opclass which asserts they do and that it
> should be the collation. If so then the collation is stored in the sort key
> and the indcollation slots and is passed to the operator or the btproc
> function as the third argument.

The way I approached it was to just mark the opclass (perhaps opfamily
would be better now, it didn't exist at the time). Additionally you
will need to mark the individual operators/functions as to whether
they're sensetive to the collation or not. This is needed because an
Error state in the collation oid is only an error if applied to an
operator that cares. This might not be needed in the first
implementation, though.

> We think even without the parser changes this would be useful enough to take
> on its own. It would allow having indexes built in different collations and
> have different sessions use different collations. You could even hack queries
> which combine collations by defining a function handle subparts of the query
> and attach the guc the same way we attach the search_path to security definer
> functions.

Ugh. I think just implementing the parser changes will be less work.

> To finish the feature the catalog needs to add a default collation to every
> column and some other database objects. Then the parser needs to check those
> objects first before falling back to the session variable. Then it needs to
> bubble that value up as it builds expressions so that it's available at the
> comparison operator or sort node. We didn't read the part of the spec which
> covered the rules for this but it does have rules which should turn up if you
> search for "collation derivation".

Once you've added the default collation to every object, the session
variable becomes redundant. Even unknown text strings will have a
collation oid as soon as they are cast to text, since the text type
will have a default collation (just like it has a default operator
class).

Nice to see some progress being made here.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-18 15:28:41
Message-ID: 9891.1224343721@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> It's quite straightforward. Every expression has a collation, the COLLATE
> keyword just overrides it. And the collation is a parameter of the
> operators/functions that want to use it. Implementation is also
> straightforward: add expr :: expr COLLATE ident to the bison code and a
> collation oid to the Expr node.

It's fairly irritating to think that a string-specific option is going
to become part of the fundamental type system --- it makes no sense to
distinguish different collations for numeric for instance (and in fact
I would want to see the parser throw an error if you tried to attach a
collation to a non-string type). As for inserting a collation OID in
every single variant of Expr, the answer is "no".

Another objection to this design is that it's completely unclear that
functions from text to text should necessarily yield the same collation
that went into them, but if you treat collation as a hard-wired part of
the expression syntax tree you aren't going to be able to do anything else.
(What will you do about functions/operators taking more than one text
argument?)

I think it would be better to treat the collation indicator as part of
string *values* and let it bubble up through expressions that way.
The "expr COLLATE ident" syntax would be a simple run-time operation
that pokes a new collation into a string value. The notion of a column
having a particular collation would then amount to a check constraint on
the values going into the column.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-18 16:07:25
Message-ID: 87prlxx4xu.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> It's fairly irritating to think that a string-specific option is going
> to become part of the fundamental type system --- it makes no sense to
> distinguish different collations for numeric for instance

Actually I thought of that generality as an advantage. Just because we can't
think of any right now doesn't mean there aren't applications of this. The
only example I could think of was a comparison operator on numeric which
specifies a significant precision. That doesn't sound super useful but there
are a lot of data types out there and I don't see any reason to think text is
the only one in the world that will have more than one reasonable ordering.

> I think it would be better to treat the collation indicator as part of
> string *values* and let it bubble up through expressions that way.
> The "expr COLLATE ident" syntax would be a simple run-time operation
> that pokes a new collation into a string value. The notion of a column
> having a particular collation would then amount to a check constraint on
> the values going into the column.

I'm not super familiar with the spec here but from what I understood I think
this would be very different.

For instance, I think you need to be able to set the default collation on a
whole column after the fact. Rewriting the whole table to handle a collation
change seems like a non-starter.

Also, if the column doesn't have a default collation specified then you need
to use the default collation for a more general object -- I'm not sure if it's
table or schema next.

Thirdly, to handle resolving conflicting default collations you need to track
where the source of the default collation was. Ie, whether it was a default or
an explicit choice by the query.

Collation isn't really a property of the text at all. This design would force
the sources of text to pick a collation that will be used by other parts of
the application that they know nothing about. How is an DBA using COPY to
populate a table going to know what collation the web app which eventually
uses the data in that table will want to use?

The other side of the coin is that given the spec-compliant behaviour you can
always emulate the behaviour you're describing by adding another column. It
would be more useful too since you'll have a "language" column which may be
useful independently from the text content.

And of course the scheme you're describing would waste a huge amount of space
in every string on disk. For short strings it could triple the amount of space
(plus I think the explicit vs implicit collation would make it even worse).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-18 18:00:40
Message-ID: 20081018180040.GC1320@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 18, 2008 at 11:28:41AM -0400, Tom Lane wrote:
> It's fairly irritating to think that a string-specific option is going
> to become part of the fundamental type system --- it makes no sense to
> distinguish different collations for numeric for instance (and in fact
> I would want to see the parser throw an error if you tried to attach a
> collation to a non-string type). As for inserting a collation OID in
> every single variant of Expr, the answer is "no".

It's hardly string specific. At least geometric types have the concept
of multiple collations. Any datatype can have a set of useful
collations, I'd hate to restrict this to just strings.

> Another objection to this design is that it's completely unclear that
> functions from text to text should necessarily yield the same collation
> that went into them, but if you treat collation as a hard-wired part of
> the expression syntax tree you aren't going to be able to do anything else.
> (What will you do about functions/operators taking more than one text
> argument?)

Well, the standard has plenty of disscussion about that. If the
operator is a comparison the two collations must be coerible to the
same collation. If it's not and the result type isn't text then it
doesn't matter. And if the result is text then you mark it as ambiguous
and thus require the user to specify explicitly.

> I think it would be better to treat the collation indicator as part of
> string *values* and let it bubble up through expressions that way.

I think the storage space considerations make that a difficult option.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-20 09:28:45
Message-ID: 48FC4F4D.2040403@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Another objection to this design is that it's completely unclear that
> functions from text to text should necessarily yield the same collation
> that went into them, but if you treat collation as a hard-wired part of
> the expression syntax tree you aren't going to be able to do anything else.
> (What will you do about functions/operators taking more than one text
> argument?)

Whatever the spec says. Collation is intimately associated with the
comparison operations, and doesn't make any sense anywhere else. The way
the default collation for a given operation is determined, by bubbling
up the collation from the operands, through function calls and other
expressions, is just to make life a bit easier for the developer who's
writing the SQL. We could demand that you always explicitly specify a
collation when you use the text equality or inequality operators, but
because that would be quite tiresome, a reasonable default is derived
from the context.

I believe the spec stipulates how that default is derived, so I don't
think we need to fret over it. We'll need it eventually, but the parser
changes is not the critical part. We can start off by deriving the
collation from a GUC variable, for example.

> I think it would be better to treat the collation indicator as part of
> string *values* and let it bubble up through expressions that way.
> The "expr COLLATE ident" syntax would be a simple run-time operation
> that pokes a new collation into a string value. The notion of a column
> having a particular collation would then amount to a check constraint on
> the values going into the column.

Looking at an individual value, collation just doesn't make sense.
Collation is property of the comparison operation, not of a value.

In the parser, we might have to do something like that though, because
according to the standard you can tack the COLLATION keyword to string
constants and have it bubble up. But let's keep that ugliness just
inside the parser.

One, impractical, way to implement collation would be to have one
operator class per collation. In fact you could do that today, with no
backend changes, to support multiple collations. It's totally
impractical, because for starters you'd need different comparison
operators, with different names, for each collation. But it's the right
mental model.

I think the right approach is to invent a new concept called "operator
modifier". It's basically a 3rd argument to operators. It can be
specified explicitly when an operator is used, with syntax like "<left>
Op <right> USING <modifier>", or in case of collation, it's derived from
the context, per SQL spec. The operator modifier is tacked on to OpExprs
and SortClauses in the parser, and passed as a 3rd argument to the
function implementing the operator at execution time.

When an index is created, if the operators in the operator class take an
operator modifier, it's stored at creation time into a new column in
pg_index (needs to be a vector or array to handle multi-column indexes).
The planner needs to check the modifier when it determines whether an
index can be used or not.

BTW, this reminds me of the discussions we had about the tsearch default
configuration. It's different, though, because in full text search,
there's a separate tsvector data type, and the problem was with
expression indexes, not regular ones.

Another consideration is LC_CTYPE. Just like we want to support
different collations, we should support different character
classifications for upper()/lower(). We might want to tie it into
collation, as using different ctype and collation doesn't usually make
sense, but it's something to keep in mind.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Dave Gudeman" <dave(dot)gudeman(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-22 17:43:06
Message-ID: 7b079fba0810221043o4d205782p883d8a8df84f54f9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 20, 2008 at 2:28 AM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Tom Lane wrote:
>
>> Another objection to this design is that it's completely unclear that
>> functions from text to text should necessarily yield the same collation
>> that went into them, but if you treat collation as a hard-wired part of
>> the expression syntax tree you aren't going to be able to do anything
>> else.
>> (What will you do about functions/operators taking more than one text
>> argument?)
>>
>
> Whatever the spec says. Collation is intimately associated with the
> comparison operations, and doesn't make any sense anywhere else.

Of course the comparison operator is involved in many areas such as index
creation, ORDER BY, GROUP BY, etc. In order to support GROUP BY and hash
joins on values with a collation type, you need to have a hash function
corresponding to the collation.

> The way the default collation for a given operation is determined, by
> bubbling up the collation from the operands, through function calls and
> other expressions, is just to make life a bit easier for the developer who's
> writing the SQL.We could demand that you always explicitly specify a
> collation when you use the text equality or inequality operators, but
> because that would be quite tiresome, a reasonable default is derived from
> the context.

In this sense, collation is no different from any other feature of the
value's type. You could require explicit type annotations everywhere.

> Looking at an individual value, collation just doesn't make sense.
> Collation is property of the comparison operation, not of a value.
>

Collation can't be a property of the comparison operation because you don't
know what comparison to use until you know the collation type of the value.
Collation is a property of string values, just like scale and precision are
properties of numeric values. And like those properties of numeric values,
collation can be statically determined. The rules for determining what
collation to use in an expression are similar in kind to the rules for
determining what the resulting scale and precision of an arithmetic
expression are. If you consider collation as just part of the type, a lot of
things are easier.

>
> In the parser, we might have to do something like that though, because
> according to the standard you can tack the COLLATION keyword to string
> constants and have it bubble up. But let's keep that ugliness just inside
> the parser.

The COLLATION expression is no different in kind from a type cast. It just
works on a restricted part of the type.

> One, impractical, way to implement collation would be to have one operator
> class per collation. In fact you could do that today, with no backend
> changes, to support multiple collations. It's totally impractical, because
> for starters you'd need different comparison operators, with different
> names, for each collation. But it's the right mental model.

You can use that model, but it is simpler to view it as an overloaded
function. You don't conceptually imagine that DECIMAL(10,4) and
DECIMAL(20,2) have different comparison operations, so why would you view
that two strings with different collations have different comparison
operations?

I think the right approach is to invent a new concept called "operator
> modifier". It's basically a 3rd argument to operators. It can be specified
> explicitly when an operator is used, with syntax like "<left> Op <right>
> USING <modifier>", or in case of collation, it's derived from the context,
> per SQL spec. The operator modifier is tacked on to OpExprs and SortClauses
> in the parser, and passed as a 3rd argument to the function implementing the
> operator at execution time.

This is a good way to implement collated comparisons, but it's not a new
concept, just an additional argument to the comparison operator. It isn't
necessary to create new concepts to handle collation when it fits so well
into an existing concept, the type. For example, the difference between two
indexes with collation is a difference in the type of the index --just like
the difference between a DECIMAL(10,4) index and a DECIMAL(20,2) index.

When I added collation to a commercial RDBMS it made things a lot easier to
just fold the collation into the type system. After all, the type defines
the operators that act on it and collation is just a specialization of this
notion. Incidentally, collation can be easily extended to non-string types;
it is just the section of the type information that controls how the values
are compared (and hashed). This could be very useful for datetime values and
user-defined types as well as strings.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Dave Gudeman <dave(dot)gudeman(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-23 06:25:02
Message-ID: 490018BE.4020101@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Gudeman wrote:
> On Mon, Oct 20, 2008 at 2:28 AM, Heikki Linnakangas <
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Tom Lane wrote:
>>> Another objection to this design is that it's completely unclear that
>>> functions from text to text should necessarily yield the same collation
>>> that went into them, but if you treat collation as a hard-wired part of
>>> the expression syntax tree you aren't going to be able to do anything
>>> else.
>>> (What will you do about functions/operators taking more than one text
>>> argument?)
>>>
>> Whatever the spec says. Collation is intimately associated with the
>> comparison operations, and doesn't make any sense anywhere else.
>
> Of course the comparison operator is involved in many areas such as index
> creation, ORDER BY, GROUP BY, etc. In order to support GROUP BY and hash
> joins on values with a collation type, you need to have a hash function
> corresponding to the collation.

Yeah, those are all related to comparison operators.

>> Looking at an individual value, collation just doesn't make sense.
>> Collation is property of the comparison operation, not of a value.
>
> Collation can't be a property of the comparison operation because you don't
> know what comparison to use until you know the collation type of the value.
> Collation is a property of string values, just like scale and precision are
> properties of numeric values. And like those properties of numeric values,
> collation can be statically determined. The rules for determining what
> collation to use in an expression are similar in kind to the rules for
> determining what the resulting scale and precision of an arithmetic
> expression are. If you consider collation as just part of the type, a lot of
> things are easier.

Yeah, the typmod of numerics and varchars is a good analogue, in the
parser. The current rules for those are probably not exactly the same
that the spec requires for collation, but it's definitely similar.

> This is a good way to implement collated comparisons, but it's not a new
> concept, just an additional argument to the comparison operator. It isn't
> necessary to create new concepts to handle collation when it fits so well
> into an existing concept, the type. For example, the difference between two
> indexes with collation is a difference in the type of the index --just like
> the difference between a DECIMAL(10,4) index and a DECIMAL(20,2) index.

Hmm. That could work. So collation would be an extra typemod on the
string data types, and casting can be used to force a specific
collation. I think we're missing some pieces, like passing the typmod to
the comparison function; numeric comparison doesn't depend on the scale
and precision, while collation would depend on the typemods.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com