contrib/citext versus collations

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, lr(at)pcorp(dot)us
Subject: contrib/citext versus collations
Date: 2011-06-06 20:14:00
Message-ID: 12187.1307391240@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been looking into bug #6053, in which Regina Obe complains that
hash-based DISTINCT queries fail for type "citext". The cause is not
far to seek: the header comment for execGrouping.c states

* Note: we currently assume that equality and hashing functions are not
* collation-sensitive, so the code in this file has no support for passing
* collation settings through from callers. That may have to change someday.

and indeed the failure comes directly from the fact that citext's hash
function *does* expect a collation to be passed to it. I'm a bit
embarrassed to not have noticed that citext was a counterexample for
this assumption, especially since I already fixed one bug that should
have clued me in (commit a0b75a41a907e1582acdb8aa6ebb9cacca39d7d8).

Now, removing this assumption from execGrouping.c is already a pretty
sizable task --- for starters, at least plan node types Agg, Group,
SetOp, Unique, and WindowAgg would need collation attributes that they
don't have today. But the assumption that equality operators are not
collation-sensitive is baked into a number of other places too; for
instance
nodeAgg.c @ line 600
indxpath.c @ line 2200
prepunion.c @ line 640
ri_triggers.c @ line 3000
and that's just places where there's a comment about it :-(.

It's worth noting also that in many of these places, paying attention to
collation is not merely going to need more coding; it will directly
translate to a performance hit, one that is entirely unnecessary for the
normal case where collation doesn't affect equality.

So this leaves us between a rock and a hard place. I think there's just
about no chance of fixing all these things without a serious fresh slip
in the 9.1 schedule. Also, I'm *not* prepared to fix these things
personally. I already regret the amount of time I put into collations
this past winter/spring, and am not willing to drop another several
weeks down that sinkhole right now.

The most workable alternative that I can see is to lobotomize citext so
that it always does lower-casing according to the database's "default"
collation, which would allow us to pretend that its notion of equality
is not collation-sensitive after all. We could hope to improve this in
future release cycles, but not till we've done the infrastructure work
outlined above. One bit of infrastructure that might be a good idea is
a flag to indicate whether an equality operator's behavior is
potentially collation-dependent, so that we could avoid taking
performance hits in the normal case.

Comments, other ideas?

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, lr(at)pcorp(dot)us
Subject: Re: contrib/citext versus collations
Date: 2011-06-06 20:19:48
Message-ID: 4AADA7A0-BB6E-4B22-B108-9D1036409302@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 6, 2011, at 1:14 PM, Tom Lane wrote:

> The most workable alternative that I can see is to lobotomize citext so
> that it always does lower-casing according to the database's "default"
> collation, which would allow us to pretend that its notion of equality
> is not collation-sensitive after all.

+1 Seems like the right thing to do for now.

> We could hope to improve this in
> future release cycles, but not till we've done the infrastructure work
> outlined above. One bit of infrastructure that might be a good idea is
> a flag to indicate whether an equality operator's behavior is
> potentially collation-dependent, so that we could avoid taking
> performance hits in the normal case.

That sounds like a good idea.

Best,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, lr(at)pcorp(dot)us
Subject: Re: contrib/citext versus collations
Date: 2011-06-06 23:35:37
Message-ID: 15559.1307403337@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jun 6, 2011, at 1:14 PM, Tom Lane wrote:
>> ... One bit of infrastructure that might be a good idea is
>> a flag to indicate whether an equality operator's behavior is
>> potentially collation-dependent, so that we could avoid taking
>> performance hits in the normal case.

> That sounds like a good idea.

BTW, it struck me shortly after sending this that we'd already discussed
the idea of a flag in pg_proc showing whether a function pays attention
to collation. We could of course use that for this purpose.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, lr(at)pcorp(dot)us
Subject: Re: contrib/citext versus collations
Date: 2011-06-07 06:02:51
Message-ID: E61953F2-26F7-49BD-B8A0-CE286806E481@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 6, 2011, at 4:35 PM, Tom Lane wrote:

>> That sounds like a good idea.
>
> BTW, it struck me shortly after sending this that we'd already discussed
> the idea of a flag in pg_proc showing whether a function pays attention
> to collation. We could of course use that for this purpose.

Seems like a no-brainer.

Best,

David


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, lr(at)pcorp(dot)us
Subject: Re: contrib/citext versus collations
Date: 2011-06-07 19:49:23
Message-ID: BANLkTinm6bpkaA-DSOVxnamq5bPRBvAqxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 6, 2011 at 9:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The most workable alternative that I can see is to lobotomize citext so
> that it always does lower-casing according to the database's "default"
> collation, which would allow us to pretend that its notion of equality
> is not collation-sensitive after all.  We could hope to improve this in
> future release cycles, but not till we've done the infrastructure work
> outlined above.  One bit of infrastructure that might be a good idea is
> a flag to indicate whether an equality operator's behavior is
> potentially collation-dependent, so that we could avoid taking
> performance hits in the normal case.
>
> Comments, other ideas?

That would also mean that 9.1's citext will be no worse than 9.0, it
just won't have the 9.1 collation goodness.

Random thought -- the collation used for citext is not really the same
as the default collation for ordering in sql. Perhaps it could be
stored in the typmod? So you could declare different columns to be
case insensitive according to specific collations. And it would be
free to cast between them but would have to be explicit. I'm not sure
that's actually a good idea, it was just a first thought,

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, lr(at)pcorp(dot)us
Subject: Re: contrib/citext versus collations
Date: 2011-06-07 20:23:01
Message-ID: 19683.1307478181@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Mon, Jun 6, 2011 at 9:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The most workable alternative that I can see is to lobotomize citext so
>> that it always does lower-casing according to the database's "default"
>> collation, which would allow us to pretend that its notion of equality
>> is not collation-sensitive after all.

> That would also mean that 9.1's citext will be no worse than 9.0, it
> just won't have the 9.1 collation goodness.

On further reflection, I'm wondering exactly how much goodness to chop
off there. What I'd originally been thinking was to just lobotomize the
case-folding step, and allow citext's comparison operators to still
respond to input collation when comparing the folded strings. However,
I can imagine that some combinations of languages might produce pretty
weird results if we do that. Should we lobotomize the comparisons too?
Or is the ability to affect the sort order valuable enough to put up
with whatever corner-case funnies there might be?

> Random thought -- the collation used for citext is not really the same
> as the default collation for ordering in sql. Perhaps it could be
> stored in the typmod?

Again, I'm wondering whether that's really a good idea. I think the
currently implemented behavior of citext (fold and compare both act
according to input collation) is really the right thing ... we just
can't do it all yet.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, lr(at)pcorp(dot)us
Subject: Re: contrib/citext versus collations
Date: 2011-06-08 19:26:42
Message-ID: 2837.1307561202@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> On further reflection, I'm wondering exactly how much goodness to chop
> off there. What I'd originally been thinking was to just lobotomize the
> case-folding step, and allow citext's comparison operators to still
> respond to input collation when comparing the folded strings. However,
> I can imagine that some combinations of languages might produce pretty
> weird results if we do that. Should we lobotomize the comparisons too?
> Or is the ability to affect the sort order valuable enough to put up
> with whatever corner-case funnies there might be?

For lack of any comment on this point, I went with the first approach.
Patch is committed.

regards, tom lane