Re: PATCH: CITEXT 2.0 v3

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v3
Date: 2008-07-14 17:48:44
Message-ID: EC8BD896-825A-4098-9A6E-6024DBF28078@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Jul 14, 2008, at 07:24, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> Could I supply two comparison files, one for Mac OS X with
>> en_US.UTF-8
>> and one for everything else, as described in the last three
>> paragraphs
>> here?
>
> The fallacy in that proposal is the assumption that there are only two
> behaviors out there.

Well, no, that's not the assumption at all. The assumption is that the
type works properly with multibyte characters under multibyte-aware
locales. So I want to have tests to ensure that such is true by having
multibyte characters run under a very specific locale and platform. I
don't really care what platform or locale; the point is to make sure
that the type is actually multibyte-aware.

> Let me recalibrate your thoughts a bit: so far
> I have tried citext on three different machines (Mac, Fedora 8, HPUX),
> and I got three different answers from those tests. That's despite
> endeavoring to make the database locales match ... which is less than
> trivial in itself because they use three slightly different
> spellings of
> "en_US.UTF8".

<rant>
This is a truly pitiful state of affairs. Rhetorical question: Why is
there no standardization of locales? I'm sure there are a lot of
opinions out there (should all uppercase chars should precede all
lowercase chars or be mixed in with lowercase chars), but I should
think that, in this day and age, there would be some sort of standard
defining locales and how they work -- and to allow such opinions to be
expressed by different locales, not in the same locale names on
different platforms.
</rant>

> Given that you were more or less deliberately testing corner cases,
> I think it's quite likely that the number of observable reactions from
> N platforms would be more nearly O(N) than O(1).

To me they're not corner cases. To me it is just, "given a specific
platform/locale, does CITEXT respect the locale's rules?" I don't care
to test all platforms and locales (I'm not *that* stupid :-)).

> In the real world, to the extent that we are able to control the
> locale
> of the regression tests, we make it "C" --- and to a large extent we
> can't control it at all, which means you have another uncontrolled
> variable besides platform. So in the current universe there is
> absolutely no value in submitting locale-specific tests for a contrib
> module.

Then how do we know that it will continue to be locale-aware over
time? Someone could replace the comparison function with one that just
lowercases ASCII characters, like CITEXT 1 does, and no tests would
fail. How do you prevent that from happening without being hyper-
vigilant (and never leaving the project, I might add)?

> I see some discussion in the thread about improving the situation, but
> until we are able to decouple database locale from environment locale,
> I doubt we'll be able to do a whole lot about automating this kind
> of test. There are too many variables at the moment.

Is the decoupling of database locale from environment locale likely to
happen anytime soon? Now that I've written CITEXT, I dare say that
such might become my top-desired feature (aside from replication).

Thanks for the discussion, much appreciated, and I'm learning a ton. I
retain the right to be opinionated, however. ;-)

Best,

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kless 2008-07-14 17:49:15 Re: Fwd: Proposal - UUID data type
Previous Message David E. Wheeler 2008-07-14 17:36:48 Re: PATCH: CITEXT 2.0 v3