Re: Per-column collation, the finale

From: Noah Misch <noah(at)leadboat(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Per-column collation, the finale
Date: 2011-02-03 05:10:53
Message-ID: 20110203051053.GA713@tornado.gateway.2wire.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 02, 2011 at 11:20:44PM +0200, Peter Eisentraut wrote:
> On l??r, 2011-01-29 at 02:52 -0500, Noah Misch wrote:
> > The new documentation is helpful. It would be useful to document the
> > implications of marking your user-defined type COLLATABLE. As best I can tell,
> > you should only set COLLATABLE if functions that would be expected to respect
> > collation, particularly members of a B-tree operator family, do check the fmgr
> > collation. Otherwise, the user might specify collations for data of your type
> > and be surprised to get no differentiated behavior. Are there other
> > considerations? Also, should implementers of COLLATABLE types observe any
> > restrictions on when to respect the collation? For example, is equality
> > permitted to be sensitive to collation?
>
> I improved the documentation in the CREATE TYPE man page about this.
> Note, however, that the questions you raise are not new. There already
> is a set of types that observe locale information in a set of associated
> functions. The answers are already hidden somewhere; this feature just
> exposes them in a different way. Over time, we may want to document
> more of this, but at the moment it's not clear how much to actually
> expose.

That's a helpful way to think about it.

> > On that note, the following is accepted; should it be?
> >
> > CREATE TABLE parent (key text COLLATE "sv_SE");
> > CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID");
> > CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key));
> >
> > Does the system expect any invariants to hold on the platform implementation of
> > collations? For example, that they be consistent with equality in some way? If
> > so, would it be practical and/or helpful to provide a tool for testing a given
> > platform locale for conformance?
>
> We discussed this previously in this thread. Currently, collation
> support does not affect equality. Values are only equal if they are
> bytewise equal. That's why the above is currently OK, but it should
> probably be restricted in the future. I'll make a note about it.

I see that this is true for texteq, bpchareq and citext_eq. However, nothing
stops a user from creating a COLLATABLE type and making the "=" operator for its
B-tree operator class sensitive to collation, right? Could that break any
system assumptions? That is, do we need to document that user-defined hash and
B-tree operator classes must ignore locale in their "=" operators?

> > Though there's no reason it ought to happen with the first commit, it would be
> > really excellent for "make check" to detect when the platform has sufficient
> > support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
> > locales). Manual execution is fine for something like numeric_big, but this is
> > so platform-sensitive that testing it widely is important.
>
> I solicited ideas about this a while ago, but no one came up with a
> solution. If we do get Mac or Windows support, it would be a good time
> to revisit this and devise a platform-independent approach, if possible.

As a rough idea, we could introduce the notion of a "skip condition" SQL command
for a pg_regress test file. When the command throws an error, pg_regress skips
that file. For these tests, the skip condition would resemble:
SELECT 'foo' COLLATE "sv_SE", 'bar' COLLATE "tr_TR"

> > Four call sites gain code like this (example is from varstr_cmp):
> >
> > +#ifdef HAVE_LOCALE_T
> > + locale_t mylocale = pg_newlocale_from_collation(collid);
> > +#else
> > + check_default_collation(collid);
> > +#endif
> > ...
> > +#ifdef HAVE_LOCALE_T
> > + result = strcoll_l(a1p, a2p, mylocale);
> > +#else
> > result = strcoll(a1p, a2p);
> > +#endif
> >
> > Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
> > that merely calls check_default_collation, "#define strcoll_l(a, b, c)
> > strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the
> > abstraction would make the mainline code significantly cleaner. Even so, it
> > might not pay off. Thoughts?
>
> I had considered that, but I'm mainly hesitant about introducing a fake
> locale_t type into the public namespace. Maybe it should be wrapped
> into a pg_locale_t; then we can do whatever we want.

True. There are advantages and disadvantages to both. No strong opinion here.

> > Couldn't check_default_collation be a no-op except on assert-only builds? It's
> > currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
> > have added non-default collations. CREATE COLLATION will presumably be made to
> > fail unconditionally on such builds, too.
>
> Unless you register the HAVE_LOCALE_T state in pg_control, you need to
> be prepared to deal with schemas that contain nondefault collations even
> though the server binary doesn't support it.

Makes sense.

> > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc;
> > -- worked ...
> > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> > -- worked ...
> > [local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> > ERROR: column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function
> > LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE...
>
> This is correct, I think. At least if you want to keep open the
> possibility that a collation could also determine equality.

Yes; I agree, now. Replace "COLLATE "id_ID"" with "||''", and you get the same.

> > UNION on disparate collations seems to error or strip them, depending on how
> > it's written:
> >
> > CREATE TABLE t_sv (c text COLLATE "sv_SE");
> > CREATE TABLE t_tr (c text COLLATE "tr_TR");
> > CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr;
> > -- t_both.c has default collation
> > CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR";
> > -- ERROR: 42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR"
>
> This is correct per SQL standard. In the second case, the explicit
> COLLATE clauses cause a conflict, so you get an error. In the first
> case, you have implicit collation derivations that conflict, so the
> result has actually no collation (not default collation), which means
> you cannot sort the resulting value at all. It's actually a bit strange
> here that we allow the creation of a table based on a query that has a
> no collation derivation, but I did not see anything in the SQL standard
> that prohibits it. Maybe a notice would be useful, like when you create
> a column of type "unknown".

After rereading about collation derivation several times, I think I see why it's
correct. MS SQL Server has the same behaviors, too. (The exception I found:
SQL Server rejects the first UNION ALL above. Both implementations reject it
with UNION ALL changed to UNION.)

> > COLLATE is getting recognized in various places where it does not actually do
> > anything. Probably a few more places need SimpleTypenameWithoutCollation:
> >
> > CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text);
> > ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm;
> > CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION;
> > ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm;
> > CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';
> > CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text';
>
> All of these places also accept useless typmods. The parser is
> currently pretty loose about that. It would need some major
> restructuring to fix that.

Okay. Peripheral to this patch, then.

> > The 13% slowdown with the feature unused seems worrisome, but the additional 16%
> > slowdown when using the feature to adopt a different locale seems fine.
>
> I figured out this problem. It needs some small restructuring in
> varstr_cmp to fix. It kind of relates to how the
> pg_newlocale_from_collation/check_default_collation restructuring is
> done. But it's definitely fixable.
>
> Well, my numbers were actually smaller than 13%, but it was noticeable.
> I'll give you something new to test later.

Great.

> > Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
> > to have a different concept of collation naming. This patch creates a "default"
> > collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The
> > standard has specific names for specific default collations based on the
> > character repertoire (our server_encoding, I suppose). So, the most-conformant
> > thing would be to set the name of the pg_collation of oid = 100 at CREATE
> > DATABASE time, based on the chosen ENCODING. Note sure how valuable this is.
>
> The problem is, you can't reach across databases when you create a new
> one. This is really the root of all evil with this default collation
> business.

Understood. Doesn't seem important to jump through hoops for this.

> This is good stuff. I'll send you a new patch in a day or three for
> perhaps another round of performance tests. Some of the other issues
> above can perhaps be postponed for follow-up patches.

I agree -- if the performance-when-unused gets solid, none of my other comments
ought to hold things up.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-02-03 05:26:36 Re: sepgsql contrib module
Previous Message Robert Haas 2011-02-03 04:43:35 Re: sepgsql contrib module