Quick Links

Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields

Lists:	pgsql-odbc

From:	"Dave Page" <dpage(at)vale-housing(dot)co(dot)uk>
To:	"Ludek Finstrle" <luf(at)pzkagis(dot)cz>
Cc:	"Johann Zuschlag" <zuschlag2(at)online(dot)de>, "Hiroshi Inoue" <inoue(at)tpf(dot)co(dot)jp>, <pgsql-odbc(at)postgresql(dot)org>
Subject:	Re: psqlODBC-Driver Test / text fields
Date:	2006-03-29 15:34:52
Message-ID:	E7F85A1B5FF8D44C8A1AF6885BC9A0E4011C98F3@ratbert.vale-housing.co.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

> -----Original Message-----
> From: Ludek Finstrle [mailto:luf(at)pzkagis(dot)cz]
> Sent: 29 March 2006 16:23
> To: Dave Page
> Cc: Ludek Finstrle; Johann Zuschlag; Hiroshi Inoue;
> pgsql-odbc(at)postgresql(dot)org
> Subject: Re: [ODBC] psqlODBC-Driver Test / text fields
>
> Let's try read and it's ancestor:
> http://archives.postgresql.org/pgsql-odbc/2006-03/msg00188.php

I'm not sure I understand that that test is actually valid anyway. Consider the test query:

select name from kunde where name >= 'Ã¶';

If 'Ã¶' is 'ö', then isn't the query above mixing single and a multibyte encoding? Ie. It should all be single byte - e.g.

select name from kunde where name >= 'ö' order by name asc;

Or all multibyte (displayed byte by byte) whatever that results in:

s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e* *>*=* *'*Ã¶'*;*

Of course, we all know how well I grok encoding issues :-)

Regards, Dave.

From:	Ludek Finstrle <luf(at)pzkagis(dot)cz>
To:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>
Cc:	Ludek Finstrle <luf(at)pzkagis(dot)cz>, Johann Zuschlag <zuschlag2(at)online(dot)de>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: psqlODBC-Driver Test / text fields
Date:	2006-03-29 15:43:30
Message-ID:	20060329154330.GM18148@soptik.pzkagis.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

> I'm not sure I understand that that test is actually valid anyway.

...

> Of course, we all know how well I grok encoding issues :-)

I know the encoding issue (UTF8, UNICODE, ...) at same level as you.
My good knowledge ending at single byte encoding.

I see no progress for some time so I hope someone in pgsql-general
(maybe peter_e) could help when the (or similar) problem is in psql
client too.

Regards,

Luf

From:	Johann Zuschlag <zuschlag2(at)online(dot)de>
To:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>
Cc:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, pgsql-odbc(at)postgresql(dot)org
Subject:	Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Date:	2006-03-30 19:41:06
Message-ID:	442C3452.5020704@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Dave Page schrieb:
> If 'Ã¶' is 'ö', then isn't the query above mixing single and a multibyte encoding? Ie. It should all be single byte - e.g.
>
> select name from kunde where name >= 'ö' order by name asc;
>
> Or all multibyte (displayed byte by byte) whatever that results in:
>
> s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e* *>*=* *'*Ã¶'*;*
>
> Of course, we all know how well I grok encoding issues :-)
>
Hi Dave,

I can understand you. This encoding issues drive me also crazy some
times. :-)

The problem with UTF-8 is that all ASCII characters are represented by
one byte and all non ASCII characters, e.g. German Umlauts, are
represented by two bytes. That's why UTF-8 is called a "variable-length
multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two
bytes, every character is represented by two bytes (fixed-length
multibyte encoding). So Unicode is not equal to UTF-8, even though the
PostgreSQL documentation is stating that.

If you like, see: http://www.utf8-chartable.de/ or some explanation at
http://czyborra.com/utf/

Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
Unfortunately (or fortunately?) Windows seems to use UTF-8 for European
languages. Hiroshi can you explain that? I guess the Japanese edition of
Windows XP is using pure 2 byte Unicode.

I can't say anything about psql. But the new psqlodbc driver 7.03.26X
seems to handle that situation very well.

So I suppose the test was valid to a certain extend, since the
characters are handled in this mixed way in Win XP. I still have some
funny behaviour with Unicode in psql (even after setting LC_COLLATE
correctly :-) ).

For my production machines I will anyway use ISO-8859-1 (or
ISO-8859-15). Then the driver will convert all characters to single byte
avoiding all kind of problems.

But feel free to ask me for tests... ;-)

Regards,
Johann

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Johann Zuschlag <zuschlag2(at)online(dot)de>
Cc:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Date:	2006-03-30 21:35:12
Message-ID:	442C4F10.3090004@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag wrote:

> Dave Page schrieb:
>
>> If 'Ã¶' is 'ö', then isn't the query above mixing single and a
>> multibyte encoding? Ie. It should all be single byte - e.g.
>>
>> select name from kunde where name >= 'ö' order by name asc;
>>
>> Or all multibyte (displayed byte by byte) whatever that results in:
>>
>> s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e*
>> *>*=* *'*Ã¶'*;*
>>
>> Of course, we all know how well I grok encoding issues :-)
>>
>
> Hi Dave,
>
> I can understand you. This encoding issues drive me also crazy some
> times. :-)
>
> The problem with UTF-8 is that all ASCII characters are represented by
> one byte and all non ASCII characters, e.g. German Umlauts, are
> represented by two bytes. That's why UTF-8 is called a
> "variable-length multibyte encoding". In a pure Unicode world, e.g.
> U+xxxx with two bytes, every character is represented by two bytes
> (fixed-length multibyte encoding). So Unicode is not equal to UTF-8,
> even though the PostgreSQL documentation is stating that.
>
> If you like, see: http://www.utf8-chartable.de/ or some explanation at
> http://czyborra.com/utf/
>
> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
> Unfortunately (or fortunately?) Windows seems to use UTF-8 for
> European languages. Hiroshi can you explain that? I guess the Japanese
> edition of Windows XP is using pure 2 byte Unicode.

Unicode ODBC drivers handle UCS-2 not UTF-8 even in European environemt.
Unfortunately PostgreSQL doesn't handle UCS-2
directly(because it could contain NULL bytes in the string), the unicode
driver sets the client_encoding to UTF-8 automatically and
converts from UCS-2 data to UTF-8 data which the PostgreSQL backend
can understands when sending queries. So what you
can see in the backend log is UTF-8. Then the backend converts from
UTF-8 data to the server encoding data. After all, the locale
(especially LC_COLLATE) setting you need is the one which matches the
backend encoding.

>
> I can't say anything about psql. But the new psqlodbc driver 7.03.26X
> seems to handle that situation very well.
>
> So I suppose the test was valid to a certain extend,

Yes thanks. I can't test the LATINxx encoding by myself.

regards,
Hiroshi Inoue

From:	Bart Samwel <bart(at)samwel(dot)tk>
To:	Johann Zuschlag <zuschlag2(at)online(dot)de>
Cc:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-03-30 21:36:44
Message-ID:	442C4F6C.2000607@samwel.tk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag wrote:
> The problem with UTF-8 is that all ASCII characters are represented by
> one byte and all non ASCII characters, e.g. German Umlauts, are
> represented by two bytes. That's why UTF-8 is called a "variable-length
> multibyte encoding". In a pure Unicode world, e.g. U+xxxx with two
> bytes, every character is represented by two bytes (fixed-length
> multibyte encoding). So Unicode is not equal to UTF-8, even though the
> PostgreSQL documentation is stating that.

Well, it's actually even more complicated, because Unicode is actually a
32-bit character set. There is actually UTF8 (variable-length multibyte,
8 bits per unit), UTF16 (variable-length multibyte) and UTF32
(fixed-length multibyte). There is also UCS2 (fixed-length 16-bit),
which is limited to the 16 bits of the Basic Multilingual Plane, and
UCS4, which is functionally identical to UTF32. UTF-8 actually supports
up to 4 bytes per character, so it is more complete than the purely
16-bit UCS-2. Any of the variable-length encodings, and the 32-bit
UTF-32 and UCS-4 encodings can represent the whole of the character set.
A pure Unicode world can use any of those encodings, so it's a tradeoff.
If you want a direct relationship between the number of characters in a
string and the number of bytes taken, use a fixed-length encoding. If
you want to be able to encode everything, use a variable-length encoding
or a 32-bit encoding. If you want to use little space, use an 8-bit
encoding. That's it.

> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
> Unfortunately (or fortunately?) Windows seems to use UTF-8 for European
> languages. Hiroshi can you explain that? I guess the Japanese edition of
> Windows XP is using pure 2 byte Unicode.

In fact, the Win32 API is UTF-16 even in European languages(started out
as UCS-2 but became UTF-16 when Unicode went 32-bit :-) ), but it
provides an 8-bit compatibility interface. Don't know if te 8-bit
encoding is UTF-8 or plain 8-bit code pages though.

Reference: http://en.wikipedia.org/wiki/Unicode

Cheers,
Bart

From:	Johann Zuschlag <zuschlag2(at)online(dot)de>
To:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-03-31 16:51:07
Message-ID:	442D5DFB.4080501@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Hiroshi Inoue schrieb:
>
> Unicode ODBC drivers handle UCS-2 not UTF-8 even in European
> environemt. Unfortunately PostgreSQL doesn't handle UCS-2
> directly(because it could contain NULL bytes in the string), the
> unicode driver sets the client_encoding to UTF-8 automatically and
> converts from UCS-2 data to UTF-8 data which the PostgreSQL backend
> can understands when sending queries. So what you
> can see in the backend log is UTF-8. Then the backend converts from
> UTF-8 data to the server encoding data. After all, the locale
> (especially LC_COLLATE) setting you need is the one which matches the
> backend encoding.
>
Hmm..., so Windows XP uses UCS-2 or do be more correct (like Bart
mentioned) UTF-16 (which is nearly the same, except for the surrogates).
That is converted to UTF-8, sent to the backend and then converted to
the proper locale and stored. I've read about the problems with the NULL
bytes on Unix machines.

Let's have two examples:
1.
backend-1 = ISO8859-1
backend-2 = UTF-8

'A' = U+0041 (does windows use big-endian?)

Win UCS-2: U+0041
ODBC UTF-8: U+41
backend-1 stores = 0x41
backend-2 stores = U+41

2.
'Ä' = U+00C4 (german A-Umlaut)

Win UCS-2: U+00C4
ODBC UTF-8: U+C384
backend-1 stores = 0xC4
backend-2 stores = U+C384

Did I get that right? So I have to be really careful when testing.

Regards,
Johann

From:	Johann Zuschlag <zuschlag2(at)online(dot)de>
To:
Cc:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-03-31 16:58:57
Message-ID:	442D5FD1.4010909@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag schrieb:
> Let's have two examples:
> 1.
> backend-1 = ISO8859-1
> backend-2 = UTF-8
>
> 'A' = U+0041 (does windows use big-endian?)
>
> Win UCS-2: U+0041
> ODBC UTF-8: U+41
> backend-1 stores = 0x41
> backend-2 stores = U+41
>
> 2.
> 'Ä' = U+00C4 (german A-Umlaut)
>
> Win UCS-2: U+00C4
> ODBC UTF-8: U+C384
> backend-1 stores = 0xC4
> backend-2 stores = U+C384
>
> Did I get that right? So I have to be really careful when testing.
>
No, again wrong. Or is it more like this:

1.
a) locale = ISO8859-1
backend-1 = LATIN1

b) locale = UTF-8
backend-2 = Unicode

'A' = U+0041 (does windows use big-endian?)

Win UCS-2: U+0041
ODBC UTF-8: U+41
backend-1 stores = U+41
backend-2 stores = U+0041

2.
'Ä' = U+00C4 (german A-Umlaut)

Win UCS-2: U+00C4
ODBC UTF-8: U+C384
backend-1 stores = 0xC4
backend-2 stores = U+00C4

Did I get that right?

Regards,
Johann

From:	Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
To:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-03-31 18:47:05
Message-ID:	khjzmj6ih92.fsf@meije.emic.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag <zuschlag2(at)online(dot)de> writes:

> 'A' = U+0041 (does windows use big-endian?)

Argh, please do not make it even more complex than it needs to be!

Endianness is by chance an _independent_ issue. You just care about it
at the low-low level when dealing with files or network sockets, but
then it's over and you never want to hear about it anymore at a higher
level.

So U+0041 is an integer whose value is:

zero thousand zero hundred forty one

and this is always true, whatever is the byte ordering used by the
processor. You don't need to know more than this, even when converting
to UTF-8 or anything else.

From:	Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
To:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-03-31 19:02:38
Message-ID:	khjsloyigj5.fsf@meije.emic.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag <zuschlag2(at)online(dot)de> writes:

> Hmm..., so Windows XP uses UCS-2 or do be more correct (like Bart
> mentioned) UTF-16 (which is nearly the same, except for the
> surrogates).

It's nearly the same... but that makes a huge difference.

The reason why you use fixed-character length encoding in memory is
speed. This saves you a lot of time when computing string lengths,
look for some characters (isalnum(),...), collating etc.

If don't care about all this speed then you better stay in a
variable-length encoding like UTF-8 which saves you A LOT of space,
especially with small occidental alphabets.

I think that by moving from UCS-2 to UTF-16 you lose on BOTH sides
[insert some missing benchmarks here]

And you can be sure that it brings a lot of bugs: one bug every
time some string code has been "forgotten" and not updated, still
assuming UCS-2.

Anyway those bugs are only for far-away and unknown countries out of
the BMP so who cares? :-/

So it really looks like a poor compatibility hack to me (java does it
too).

From:	Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
To:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-03-31 19:12:13
Message-ID:	khjodzmig36.fsf@meije.emic.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag <zuschlag2(at)online(dot)de> writes:

> I've read about the problems with the NULL bytes on Unix machines.

This problem is not related to Unix at all but to the programming
language used. Most standard C functions use the zero byte convention
as a string terminator, so it becomes a forbidden character in C.

On the other hand String objects in C++ and Java use a separate length
field, and having NULLs inside a string is a no brainer there.

The ODBC API has been designed for C and Cobol. Cobol does not forbid
zero as a character either. When browsing the ODBC spec you'll notice
it carefully caters for the two ways.

Guess which programming language is used PostgreSQL.

I suspect unicode does not care at all about this. After all unicode
is just about characters not about strings.

From:	Bart Samwel <bart(at)samwel(dot)tk>
To:	Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
Cc:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-04-01 01:26:58
Message-ID:	442DD6E2.5070500@samwel.tk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Marc Herbert wrote:
> Johann Zuschlag <zuschlag2(at)online(dot)de> writes:
>
>> I've read about the problems with the NULL bytes on Unix machines.
>
> This problem is not related to Unix at all but to the programming
> language used. Most standard C functions use the zero byte convention
> as a string terminator, so it becomes a forbidden character in C.
>
> On the other hand String objects in C++ and Java use a separate length
> field, and having NULLs inside a string is a no brainer there.
>
> The ODBC API has been designed for C and Cobol. Cobol does not forbid
> zero as a character either. When browsing the ODBC spec you'll notice
> it carefully caters for the two ways.
>
>
> Guess which programming language is used PostgreSQL.

C++ even introduced a special alternative character type "wchar_t" for
this, just so that people could handle both 8-bit char* and 16-bit
wchar_t* strings. In wchar_t* strings, 8-bit NULs are not a problem
because only 16-bit NULs count (and AFAIK the Unicode standard does
allows this to be interpreted as a NUL aka end-of-string). The downside
of this solution is that no application actually uses it, and everybody
is stuck with 8-bit ASCII plus a random local codepage unless special
support is added. Why didn't they just upgrade chars to 32 bits and be
done with it... :-/

Cheers,
Bart

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Johann Zuschlag <zuschlag2(at)online(dot)de>
Cc:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-04-01 01:35:37
Message-ID:	442DD8E9.2080007@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag wrote:
> Johann Zuschlag schrieb:
>
>>
> No, again wrong. Or is it more like this:
>
> 1.
> a) locale = ISO8859-1
> backend-1 = LATIN1
>
> b) locale = UTF-8
> backend-2 = Unicode

What do you mean by the Unicode and are you really setting b)
as above ?

First note that in PostgreSQL the encoding has nothing to do
with the locale setting. Though PostgreSQL manages the encoding
settings by itself, as for the locale setting it completely relies
on the OS environment. There exists an essential flaw from the first.
Anyway you can change the encoding as you like per database at
createdb time but the locale setting LC_COLLATE and LC_CTYPE are
fixed at initdb time.

regards,
Hiroshi Inoue

From:	Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
To:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-04-03 08:55:30
Message-ID:	khjacb3hwcd.fsf@meije.emic.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Bart Samwel <bart(at)samwel(dot)tk> writes:
>
> C++ even introduced a special alternative character type "wchar_t" for
> this, just so that people could handle both 8-bit char* and 16-bit
> wchar_t* strings. In wchar_t* strings, 8-bit NULs are not a problem
> because only 16-bit NULs count (and AFAIK the Unicode standard does
> allows this to be interpreted as a NUL aka end-of-string). The
> downside of this solution is that no application actually uses it, and
> everybody is stuck with 8-bit ASCII plus a random local codepage
> unless special support is added.

wchar_t is not defined as 16-bits, but as "wide enough to hold any
character of the platform". For instance if the platform uses UCS-4,
then wchar_t is 32 bits wide.

(UTF-16 wchar_t violates this)

I don't clearly see how you want to use a 8-bit NULL to terminate a
(wider) wchar_t array... ?

> Why didn't they just upgrade chars to 32 bits and be done with
> it... :-/

Because "char" was and is still used to store multibyte /
variable-length / encoded characters.

From:	Bart Samwel <bart(at)samwel(dot)tk>
To:	Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
Cc:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-04-03 09:03:40
Message-ID:	4430E4EC.2010208@samwel.tk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Marc Herbert wrote:
> Bart Samwel <bart(at)samwel(dot)tk> writes:
> wchar_t is not defined as 16-bits, but as "wide enough to hold any
> character of the platform". For instance if the platform uses UCS-4,
> then wchar_t is 32 bits wide.
>
> (UTF-16 wchar_t violates this)

Ahhh, this explains a lot. The same assumption used to be true for char
until they came up with UTF-8 char. And they couldn't just upgrade char
because too much code assumed that char was one byte. Then platforms
started to use UCS-2 wchar_t, then upgraded those to UTF-16 because they
couldn't just upgrade wchar_t because too much code assumed that wchar_t
was two bytes. Same pattern. Time to introduce wwchar_t_t. :-)

> I don't clearly see how you want to use a 8-bit NULL to terminate a
> (wider) wchar_t array... ?

This was a backreference to a situation mentioned earlier in the
discussion, where wchar_t buffers couldn't be "tunneled through" a layer
that used char*, as the wider wchar_t characters may contain NUL bytes.

Cheers,
Bart

From:	Johann Zuschlag <zuschlag2(at)online(dot)de>
To:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
Cc:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-04-03 09:17:03
Message-ID:	4430E80F.1010706@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Hiroshi Inoue schrieb:
> Johann Zuschlag wrote:
>> Johann Zuschlag schrieb:
>>
>>>
>> No, again wrong. Or is it more like this:
>>
>> 1.
>> a) locale = ISO8859-1
>> backend-1 = LATIN1
>>
>> b) locale = UTF-8
>> backend-2 = Unicode
>
> What do you mean by the Unicode and are you really setting b)
> as above ?
>
Oh, typo! On my system (Debian Sarge) it is in fact:

b) locale = de_DE.UTF-8
backend-2 = Unicode

Regards,
Johann

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Johann Zuschlag <zuschlag2(at)online(dot)de>
Cc:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text
Date:	2006-04-04 14:54:03
Message-ID:	4432888B.30801@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag wrote:

> Hiroshi Inoue schrieb:
>
>> Johann Zuschlag wrote:
>>
>>> Johann Zuschlag schrieb:
>>>
>>>>
>>> No, again wrong. Or is it more like this:
>>>
>>> 1.
>>> a) locale = ISO8859-1
>>> backend-1 = LATIN1
>>>
>>> b) locale = UTF-8
>>> backend-2 = Unicode
>>
>>
>> What do you mean by the Unicode and are you really setting b)
>> as above ?
>>
> Oh, typo! On my system (Debian Sarge) it is in fact:
>
> b) locale = de_DE.UTF-8
> backend-2 = Unicode

What's the result of
show server_encoding
?

regards,
Hiroshi Inoue

From:	Johann Zuschlag <zuschlag2(at)online(dot)de>
To:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
Cc:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8
Date:	2006-04-06 17:57:22
Message-ID:	44355682.7080406@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Hiroshi Inoue schrieb:
> b) locale = de_DE.UTF-8
>> backend-2 = Unicode
>
>
> What's the result of
> show server_encoding
Hi Hiroshi,

when I use an unicode database it is UNICODE,
for LATIN1 it's of course LATIN1.
(and for LATIN9 it's LATIN9)

Thank you for your explanations. Now I'm understanding what is happening
with a windows client on one end and PostgreSQL on the other end.

I am just experimenting with the three above settings (initdb with
ISO8859-15, locale ISO8859-15 only). So far it seems to work pretty
well. No problems with an application using psqlodbc driver. All German
characters including euro (LATIN9, UNICODE) seem to work. Sorting is ok.
pgAdminIII is producing the same output. psql works as well, except psql
in conjunction with unicode behaves still a little bit strange. Maybe my
locale setting is not appropriate, but de_DE.UTF-8 doesn't change it.
Will check my console later.

Unfortunately I couldn't test 7.03.262 (psqlodbc35W.dll) since the zip
file has some kind of error.

Regards,
Johann

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Johann Zuschlag <zuschlag2(at)online(dot)de>
Cc:	Dave Page <dpage(at)vale-housing(dot)co(dot)uk>, pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Unicode is not UTF-8
Date:	2006-04-07 00:21:59
Message-ID:	4435B0A7.1090508@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Johann Zuschlag wrote:
> Hiroshi Inoue schrieb:
>
>> b) locale = de_DE.UTF-8
>>
>>> backend-2 = Unicode
>>
>>
>>
>> What's the result of
>> show server_encoding
>
> Hi Hiroshi,

Hi Johann,

> Unfortunately I couldn't test 7.03.262 (psqlodbc35W.dll) since the zip
> file has some kind of error.

Sorry. Fixed.

regards,
Hiroshi Inoue