Re: Unicode Normalization

Lists: pgsql-hackers
From: pg(at)thetdh(dot)com
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Cc: "Hudson, T(dot) David" <pg1(at)thetdh(dot)com>
Subject: Re: Unicode Normalization
Date: 2009-09-24 13:24:07
Message-ID: W979322959270161253798647@webmail34
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In a context using normalization, wouldn't you typically want to store a normalized-text type that could perhaps (depending on locale) take advantage of simpler, more-efficient comparison functions? Whether you're doing INSERT/UPDATE, or importing a flat text file, if you canonicalize characters and substrings of identical meaning when trivial distinctions of encoding are irrelevant, you're better off later. User-invocable normalization functions by themselves don't make much sense. (If Postgres now supports binary- or mixed-binary-and-text flat files, perhaps for restore purposes, the same thing applies.)

David Hudson


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pg1(at)thetdh(dot)com
Cc: "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode Normalization
Date: 2009-09-24 15:36:37
Message-ID: 9BD6C83B-018E-4263-9EC8-33344FEDF655@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 24, 2009, at 6:24 AM, pg(at)thetdh(dot)com wrote:

> In a context using normalization, wouldn't you typically want to
> store a normalized-text type that could perhaps (depending on
> locale) take advantage of simpler, more-efficient comparison
> functions?

That might be nice, but I'd be wary of a geometric multiplication of
text types. We already have TEXT and CITEXT; what if we had your NTEXT
(normalized text) but I wanted it to also be case-insensitive?

> Whether you're doing INSERT/UPDATE, or importing a flat text file,
> if you canonicalize characters and substrings of identical meaning
> when trivial distinctions of encoding are irrelevant, you're better
> off later. User-invocable normalization functions by themselves
> don't make much sense.

Well, they make sense because there's nothing else right now. It's an
easy way to get some support in, and besides, it's mandated by the SQL
standard.

> (If Postgres now supports binary- or mixed-binary-and-text flat
> files, perhaps for restore purposes, the same thing applies.)

Don't follow this bit.

Best,

David


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pg1(at)thetdh(dot)com, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode Normalization
Date: 2009-09-24 15:59:09
Message-ID: 4ABB974D.5000104@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler wrote:
> On Sep 24, 2009, at 6:24 AM, pg(at)thetdh(dot)com wrote:
>
>> In a context using normalization, wouldn't you typically want to
>> store a normalized-text type that could perhaps (depending on locale)
>> take advantage of simpler, more-efficient comparison functions?
>
> That might be nice, but I'd be wary of a geometric multiplication of
> text types. We already have TEXT and CITEXT; what if we had your NTEXT
> (normalized text) but I wanted it to also be case-insensitive?

Actually, I don't think it's necessarily a good idea at all. If a user
inputs a perfectly valid piece of UTF8 text, we should be able to give
it back to them exactly, whether or not it's in normalized form. The
normalized forms are useful for certain comparison purposes, but they
don't affect the validity of the text. CITEXT doesn't mangle what is
stored, just how it's compared.

cheers

andrew


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pg1(at)thetdh(dot)com, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode Normalization
Date: 2009-09-24 16:05:58
Message-ID: 233B7C57-2096-4C9E-9704-14D1EF2164B4@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 24, 2009, at 8:59 AM, Andrew Dunstan wrote:

>> That might be nice, but I'd be wary of a geometric multiplication
>> of text types. We already have TEXT and CITEXT; what if we had your
>> NTEXT (normalized text) but I wanted it to also be case-insensitive?
>
> Actually, I don't think it's necessarily a good idea at all. If a
> user inputs a perfectly valid piece of UTF8 text, we should be able
> to give it back to them exactly, whether or not it's in normalized
> form. The normalized forms are useful for certain comparison
> purposes, but they don't affect the validity of the text. CITEXT
> doesn't mangle what is stored, just how it's compared.

Right, I don't think there's a need for a normalized TEXT type.

Best,

David