Re: UTF8 national character data type support WIP patch and list of open issues.

From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Boguk, Maksym" <maksymb(at)fast(dot)au(dot)fujitsu(dot)com>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UTF8 national character data type support WIP patch and list of open issues.
Date: 2013-09-18 22:42:29
Message-ID: B00EACB87A2441069B4EE4DEA6D9282A@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
> On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual. But I'm not sure about this, because UTF-16 is not a
>> real
>> fixed-width encoding due to supplementary characters.
>
> It seems to me that these two points here are the real core of your
> proposal. The rest is just syntactic sugar.

No, those are "desirable if possible" features. What's important is to
declare in the manual that PostgreSQL officially supports national character
types, as I stated below.

> 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
> This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms
> for CHAR/VARCHAR, and ignores N prefix. But this is not documented.
>
> 2. Declare support for national character support in the manual.
> 1 is not sufficient because users don't want to depend on undocumented
> behavior. This is exactly what the TODO item "national character support"
> in PostgreSQL TODO wiki is about.
>
> 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so
> that:
> - psql \d can display the user-specified data types.
> - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
> CHAR/VARCHAR.
> - To implement additional features for NCHAR/NVARCHAR in the future, as
> described below.

And when declaring that, we had better implement NCHAR types as distinct
types with their own OIDs so that we can extend NCHAR behavior in the
future.
As the first stage, I think it's okay to treat NCHAR types exactly the same
as CHAR/VARCHAR types. For example, in ECPG:

switch (type)
case OID_FOR_CHAR:
case OID_FOR_VARCHAR:
case OID_FOR_TEXT:
case OID_FOR_NCHAR: /* new code */
case OID_FOR_NVARCHAR: /* new code */
some processing;
break;
And in JDBC, just call methods for non-national character types.
Currently, those national character methods throw SQLException.

public void setNString(int parameterIndex, String value) throws SQLException
{
setString(parameterIndex, value);
}

> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding. In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.
>
> As to the first one, if we're going to go to the (substantial) trouble
> of building infrastructure to allow a database to store data in
> multiple encodings, why limit it to storing UTF-8 in non-UTF-8
> databases? What about storing SHIFT-JIS in UTF-8 databases, or
> Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
> combination you might care to name?
>
> Whether we go that way or not, I think storing data in one encoding in
> a database with a different encoding is going to be pretty tricky and
> require far-reaching changes. You haven't mentioned any of those
> issues or discussed how you would solve them.

Yes, you are probably right -- I'm not sure UTF-16 has really benefits that
UTF-8 doesn't have. But why did Windows and Java choose UTF-16 for internal
strings rather than UTF-8? Why did Oracle recommend UTF-16 for NCHAR? I
have no clear idea. Anyway, I don't strongly push UTF-16 and complicate the
encoding handling.

Regards
MauMau

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message MauMau 2013-09-18 22:46:37 Re: UTF8 national character data type support WIP patch and list of open issues.
Previous Message Noah Misch 2013-09-18 22:26:10 Re: relscan_details.h