Re: character encoding in StartupMessage

Lists: pgsql-hackers
From: John DeSoi <desoi(at)pgedit(dot)com>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: character encoding in StartupMessage
Date: 2006-02-28 05:05:44
Message-ID: EA7A07B8-748E-44D2-99F1-A8483D515C9B@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I could not find anything in the Frontend/Backend protocol docs about
character encoding in the StartupMessage. Assuming it is legal for a
database or user name to have unicode characters, how is this handled
when nothing yet has been said about the client encoding?

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: John DeSoi <desoi(at)pgedit(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 05:18:44
Message-ID: 4403DD34.8020709@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I could not find anything in the Frontend/Backend protocol docs about
> character encoding in the StartupMessage. Assuming it is legal for a
> database or user name to have unicode characters, how is this handled
> when nothing yet has been said about the client encoding?

A similar badness is that if you issue CREATE DATABASE from a UTF8
database, the dbname will be stored as UTF8. Then, if you go to a
LATIN1 database and create another it will be stored as LATIN1.

Then, it's impossible to display both database names on the same screen
or webpage as they have different encodings... Not only that but it's
impossible to know what encoding it IS in since it's the encoding of the
database from where you issued the CREATE DATABASE instruction from, not
the encoding of the database itself.

Chris


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 06:38:03
Message-ID: 10937.1141108683@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
>> I could not find anything in the Frontend/Backend protocol docs about
>> character encoding in the StartupMessage. Assuming it is legal for a
>> database or user name to have unicode characters, how is this handled
>> when nothing yet has been said about the client encoding?

> A similar badness is that if you issue CREATE DATABASE from a UTF8
> database, the dbname will be stored as UTF8. Then, if you go to a
> LATIN1 database and create another it will be stored as LATIN1.

Yeah, this has been discussed before. Database and user names both
have this affliction.

I don't see any very nice solution at the moment. Once we get support
for per-column locales, it might be possible to declare that the shared
catalogs are always in UTF8 encoding and get the necessary
conversions to happen automatically.

regards, tom lane


From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 06:45:25
Message-ID: 4403F185.2030808@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I don't see any very nice solution at the moment. Once we get support
> for per-column locales, it might be possible to declare that the shared
> catalogs are always in UTF8 encoding and get the necessary
> conversions to happen automatically.

At the very least, could we always convert dbnames and store them as
their own encoding? That way at least in HTML you can probably mark
them out as having particular encodings or something...

Chris


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 14:44:53
Message-ID: 20060228144453.GC535@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 28, 2006 at 02:45:25PM +0800, Christopher Kings-Lynne wrote:
> >I don't see any very nice solution at the moment. Once we get support
> >for per-column locales, it might be possible to declare that the shared
> >catalogs are always in UTF8 encoding and get the necessary
> >conversions to happen automatically.
>
> At the very least, could we always convert dbnames and store them as
> their own encoding? That way at least in HTML you can probably mark
> them out as having particular encodings or something...

This may be the only solution. Converting everything to UTF-8 has
issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
gives you a different string than you started with). There's probably
no encoding round-trip safe with every other encoding.

You could probably do things like assume that the database name is in
the same encoding as that database and set \l to output:

select convert(datname,pg_encoding_to_char(encoding),getdatabaseencoding()) from pg_database;

However, my personal preference is to treat the name of the database as
a "bunch of bits" ie, don't consider it encoded at all. To login the
user must provide the same "bunch of bits". This doesn't solve the
issue of how to display the database names to users. Maybe define a
cluster encoding for the shared catalogs...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 15:05:17
Message-ID: 20060228150517.GB6343@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:

> This may be the only solution. Converting everything to UTF-8 has
> issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
> gives you a different string than you started with). There's probably
> no encoding round-trip safe with every other encoding.

Is this still true? If I remember clearly, Tatsuo-san had asserted that
this was the case, but later he said there was some bug in our
conversion routines or the conversion tables. So maybe now that those
things are fixed (they are, aren't they?) there _is_ a safe roundtrip
from anything to UTF8 and back.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: John DeSoi <desoi(at)pgedit(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 15:14:51
Message-ID: 00081276-7339-47F3-85E0-49D1413D3831@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Feb 28, 2006, at 1:38 AM, Tom Lane wrote:

>>> I could not find anything in the Frontend/Backend protocol docs
>>> about
>>> character encoding in the StartupMessage. Assuming it is legal for a
>>> database or user name to have unicode characters, how is this
>>> handled
>>> when nothing yet has been said about the client encoding?
>
>> A similar badness is that if you issue CREATE DATABASE from a UTF8
>> database, the dbname will be stored as UTF8. Then, if you go to a
>> LATIN1 database and create another it will be stored as LATIN1.
>
> Yeah, this has been discussed before. Database and user names both
> have this affliction.

So are the database/user names in the startup message compared using
the default encoding of the cluster or is just a straight byte
comparison with no consideration of the encoding?

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 15:44:32
Message-ID: 15254.1141141472@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> However, my personal preference is to treat the name of the database as
> a "bunch of bits" ie, don't consider it encoded at all.

That's essentially what we are doing as far as the StartupMessage is
concerned. Doesn't really solve the problem of post-startup accesses to
the shared catalogs, however.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 16:14:17
Message-ID: 20060228161417.GE535@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 28, 2006 at 12:05:17PM -0300, Alvaro Herrera wrote:
> Martijn van Oosterhout wrote:
>
> > This may be the only solution. Converting everything to UTF-8 has
> > issues because some encodings are not roundtrip-safe (Enc -> UTF8 -> Enc
> > gives you a different string than you started with). There's probably
> > no encoding round-trip safe with every other encoding.
>
> Is this still true? If I remember clearly, Tatsuo-san had asserted that
> this was the case, but later he said there was some bug in our
> conversion routines or the conversion tables. So maybe now that those
> things are fixed (they are, aren't they?) there _is_ a safe roundtrip
> from anything to UTF8 and back.

I beleive so. If use the ICU Converter Explorer [1] to examine some of
the encodings we support, they have "Contains ambiguous aliases? TRUE".
This means that there are multiple converters that claim to support that
encoding, though they produce different results.

The UTF-8 and Unicode FAQ [2] also lists some issues with EUC-JP saying
that the converters had to be modified to make round-trip conversion
work. However, not all converters work the same.

Anyway, maybe it's not a big problem anymore. The ISO-2022 series is
definitly not round-trip compatable [3] but I don't think we support
them anyway. I think the only issue is if the mappings postgres uses
internally don't match what the user expects, but I don't think there's
much we can do about that...

[1] http://www-950.ibm.com/software/globalization/icu/demo/converters
[2] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[3] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 16:19:02
Message-ID: 15593.1141143542@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>>> This may be the only solution. Converting everything to UTF-8 has
>>> issues because some encodings are not roundtrip-safe

>> Is this still true?

> I beleive so. If use the ICU Converter Explorer [1] to examine some of
> the encodings we support, they have "Contains ambiguous aliases? TRUE".

Which ones, and are they client-only encodings? If all our server-side
encodings are round-trip safe then I think there's no big issue.

In any case I don't think there's a huge problem if we say that database
and user names had better be chosen from the round-trip-safe subset.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 16:45:27
Message-ID: 20060228164527.GF535@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 28, 2006 at 11:19:02AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> >>> This may be the only solution. Converting everything to UTF-8 has
> >>> issues because some encodings are not roundtrip-safe
>
> >> Is this still true?
>
> > I beleive so. If use the ICU Converter Explorer [1] to examine some of
> > the encodings we support, they have "Contains ambiguous aliases? TRUE".
>
> Which ones, and are they client-only encodings? If all our server-side
> encodings are round-trip safe then I think there's no big issue.
>
> In any case I don't think there's a huge problem if we say that database
> and user names had better be chosen from the round-trip-safe subset.

This is what it says here [1]:

There are only 19 encodings currently used worldwide as legitimate
POSIX multi-byte locale encodings:

UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6,
ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15,
EUC-JP, EUC-KR, GB2312 (= EUC-CN), KOI8-R, KOI8-U, VISCII,
WINDOWS-1251, WINDOWS-1256

Each of these is fully roundtrip compatible to ISO 10646, therefore
all these locales can be represented nicely in wchar_t as the
equivalent UCS values. The above names and the corresponding defining
documents are listed in the IANA charset registry.

Some of these have multiple definitions according to ICU meaning that
different platforms have implemented them differently in the past
(EUC-JP falls into this catagory), but presumably the IANA charset
registry has proper definitions.

Of the reminaing encodings we support, Big5 is OK, although the term
win-950 which is the windows version has changed over time. GBK has
same problem, win-936 has changed to over time. I don't think we should
concern ourselves with bugs in the windows encodings.

IOW, I think we are mostly safe.

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


From: John DeSoi <desoi(at)pgedit(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 18:52:48
Message-ID: 179EBEA8-B25D-4963-95E2-2D2250D8D63A@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Feb 28, 2006, at 11:19 AM, Tom Lane wrote:

> In any case I don't think there's a huge problem if we say that
> database
> and user names had better be chosen from the round-trip-safe subset.

What about the pg_hba.conf file? Is there a provision to specify the
encoding or some other way to deal with non-ascii characters?

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: John DeSoi <desoi(at)pgedit(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 19:49:31
Message-ID: 23602.1141156171@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

John DeSoi <desoi(at)pgedit(dot)com> writes:
> On Feb 28, 2006, at 11:19 AM, Tom Lane wrote:
> What about the pg_hba.conf file? Is there a provision to specify the
> encoding or some other way to deal with non-ascii characters?

pg_hba.conf is also processed without any locale considerations,
ie, effectively the "bunch of bits" approach Martijn mentioned.

regards, tom lane