Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields

Lists: pgsql-odbc
From: "Dave Page" <dpage(at)vale-housing(dot)co(dot)uk>
To: "Johann Zuschlag" <zuschlag2(at)online(dot)de>
Cc: "Hiroshi Inoue" <inoue(at)tpf(dot)co(dot)jp>, <pgsql-odbc(at)postgresql(dot)org>
Subject: Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Date: 2006-03-30 19:45:43
Message-ID: E7F85A1B5FF8D44C8A1AF6885BC9A0E4011C9946@ratbert.vale-housing.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-odbc

> -----Original Message-----
> From: Johann Zuschlag [mailto:zuschlag2(at)online(dot)de]
> Sent: 30 March 2006 20:41
> To: Dave Page
> Cc: Hiroshi Inoue; pgsql-odbc(at)postgresql(dot)org
> Subject: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
>
> Dave Page schrieb:
> > If 'ö' is 'ö', then isn't the query above mixing single
> and a multibyte encoding? Ie. It should all be single byte - e.g.
> >
> > select name from kunde where name >= 'ö' order by name asc;
> >
> > Or all multibyte (displayed byte by byte) whatever that results in:
> >
> > s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e*
> > *>*=* *'*ö'*;*
> >
> > Of course, we all know how well I grok encoding issues :-)
> >
> Hi Dave,
>
> I can understand you. This encoding issues drive me also
> crazy some times. :-)
>
> The problem with UTF-8 is that all ASCII characters are
> represented by one byte and all non ASCII characters, e.g.
> German Umlauts, are represented by two bytes. That's why
> UTF-8 is called a "variable-length multibyte encoding". In a
> pure Unicode world, e.g. U+xxxx with two bytes, every
> character is represented by two bytes (fixed-length multibyte
> encoding). So Unicode is not equal to UTF-8, even though the
> PostgreSQL documentation is stating that.
>
> If you like, see: http://www.utf8-chartable.de/ or some
> explanation at http://czyborra.com/utf/

Ahh, thanks for the explanation.

> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
> Unfortunately (or fortunately?) Windows seems to use UTF-8
> for European languages. Hiroshi can you explain that? I guess
> the Japanese edition of Windows XP is using pure 2 byte Unicode.

Ahh, now I do know that Windows does not fully support UTF-8. That's the very reason why it is not supported in PostgreSQL 8.0 on Windows, and in 8.1 and above requires conversion routines that were added to the server by Magnus Hagander to convert to UCS2(?) before doing any sorting etc.

> I can't say anything about psql. But the new psqlodbc driver
> 7.03.26X seems to handle that situation very well.
>
> So I suppose the test was valid to a certain extend, since
> the characters are handled in this mixed way in Win XP. I
> still have some funny behaviour with Unicode in psql (even
> after setting LC_COLLATE correctly :-) ).
>
> For my production machines I will anyway use ISO-8859-1 (or
> ISO-8859-15). Then the driver will convert all characters to
> single byte avoiding all kind of problems.
>
> But feel free to ask me for tests... ;-)

I'll need to leave that to Hiroshi - we already know we're past my knowledge on the subject :-)

Regards, Dave.


From: Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
To: pgsql-odbc(at)postgresql(dot)org
Subject: Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Date: 2006-03-31 09:22:55
Message-ID: khjek0jge8g.fsf@meije.emic.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-odbc

"Dave Page" <dpage(at)vale-housing(dot)co(dot)uk> writes:

>
>> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
>> Unfortunately (or fortunately?) Windows seems to use UTF-8
>> for European languages. Hiroshi can you explain that? I guess
>> the Japanese edition of Windows XP is using pure 2 byte Unicode.

> Ahh, now I do know that Windows does not fully support UTF-8. That's
> the very reason why it is not supported in PostgreSQL 8.0 on
> Windows, and in 8.1 and above requires conversion routines that were
> added to the server by Magnus Hagander to convert to UCS2(?) before
> doing any sorting etc.

Do you mean CP_UTF8 does not exist on some asian releases of Windows?

Thanks in advance.

Cheers,

Marc.