Quick Links

Re: Problems with charsets, investigated...

Lists:	pgsql-jdbc

From:	Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Problems with charsets, investigated...
Date:	2004-08-06 14:43:35
Message-ID:	20040806144335.933C4400E5@smtp.ies.inet6.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Hello,

I am using Postgresql 7.4.2 and its JDBC drivers, straight out from a FC2,
along with JDK 1.4.2 from Sun.
I use the JDBC driver in a web app using Enhydra appserver. Java correctly
sets its file.encoding property to the charset specified in the LANG
environment variable. However, it appears that whatever i set this
variable to, the JDBC driver seems to use UTF-8.

I have digged into the code, and seen that in the
AbstractJdbc1Connection.java class, the encoding is always forced to
"UNICODE" (therefore forcing UTF-8 on Java side).
>From that, i patched the code to correctly use the file.encoding system
property to guess the charset.

As i didn't dig very long, and as it seems from what i see in cvsweb at
gborg that all this stuff could have changed deeply, i am not sure that
this would be useful to you. However i downloaded the latest dev builds at
jdbc.postgresql.org, and it seems the bad behaviour is still there.

So, did i miss something somewhere ? Are you interested in that (frankly
quite ugly) patch ?

Regards,

Alexandre Aufrere

From:	Kris Jurka <books(at)ejurka(dot)com>
To:	Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
Cc:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-06 16:05:54
Message-ID:	Pine.BSO.4.56.0408061059400.11823@leary.csoft.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

On Fri, 6 Aug 2004, Alexandre Aufrere wrote:

> Java correctly sets its file.encoding property to the charset specified
> in the LANG environment variable. However, it appears that whatever i
> set this variable to, the JDBC driver seems to use UTF-8.
>

I'm not sure what problem or issue you think this is addressing, but it is
not something we want to do. The driver communicates with the server
using UTF-8, so you should not be adjusting this and it is entirely
transparent to the user. What you do after retrieving data is your
business and you are welcome to save it or display it in any encoding you
desire, but the driver wants to communicate with the server using UTF-8.

Kris Jurka

From:	Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-06 18:32:08
Message-ID:	20040806183208.75F47400E5@smtp.ies.inet6.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Well, no, actually i want to use LATIN1/ISO-8859-1 everywhere.
So my appserver should get ISO-8859-1 string from the driver, and not
UTF-8.
Why ? because we have a lot bunch of apps developped in ISO-8859-1, and as
well a lot of data in LATIN1, and it's out of question to put everything
in UTF-8/UNICODE.

For me, the driver should get strings encoded accordingly to the system
properties of the JVM it is run in. Or at least there should be a way to
tell the driver what charset to use. In other means, the current behaviour
is precisely NOT transparent to me, because i end up with a database in
LATIN1, whose data are converted in UTF-8 before i retrieve them from the
JDBC driver, which 1) would give me more work to convert back to
ISO-8859-1, and 2) would not be backward compatible (meaning have to test
again a LOT of apps to check we're breaking nothing).

So my hack just gets the file.encoding java system property, and requests
data to the postgresql server and handle it accordingly (namely if
file.encoding is ISO-8859-1, it requests LATIN1, and handles everything it
gets in ISO-8859-1).
Now, IMHO, ideally, the default behaviour of the JDBC driver should be to
get the encoding from pg_database table, and deduce what encoding to use
for the strings. And of course, there should be an easy way to change that
for people who want it other way.

I don't know how exactly it was working in previous versions, the fact is
that with LANG environment variables set everywhere to en_US.ISO-8859-1
and encoding in pg_database set to 8 (LATIN1), it just worked (we are
using postgresql+java+Enhydra for a long long time). Any change in that
that would involve us having to handle the charsets explicitly might be
"ideally" right, but is not backward compatible and will cause us a lot of
problems (and i'm quite sure not only to us).

Lastly, it's highly possible that i didn't see something somewhere, so i
apologize in advance for being utterly dumb ;-)

Regards,

Alexandre Aufrere

----------------------------------------------------
De : Kris Jurka <books(at)ejurka(dot)com>
A : Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
Objet : Re: [JDBC] Problems with charsets, investigated...
Date : Fri, 6 Aug 2004 11:05:54 -0500 (EST)
>
>
> On Fri, 6 Aug 2004, Alexandre Aufrere wrote:
>
> > Java correctly sets its file.encoding property to the charset
specified
> > in the LANG environment variable. However, it appears that whatever i
> > set this variable to, the JDBC driver seems to use UTF-8.
> >
>
> I'm not sure what problem or issue you think this is addressing, but it
is
> not something we want to do. The driver communicates with the server
> using UTF-8, so you should not be adjusting this and it is entirely
> transparent to the user. What you do after retrieving data is your
> business and you are welcome to save it or display it in any encoding
you
> desire, but the driver wants to communicate with the server using UTF-8.
>
> Kris Jurka
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	alexandre(dot)aufrere(at)inet6(dot)fr
Cc:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-06 22:06:30
Message-ID:	411400E6.8060704@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Alexandre Aufrere wrote:
> Hello,
>
> I am using Postgresql 7.4.2 and its JDBC drivers, straight out from a FC2,
> along with JDK 1.4.2 from Sun.
> I use the JDBC driver in a web app using Enhydra appserver. Java correctly
> sets its file.encoding property to the charset specified in the LANG
> environment variable. However, it appears that whatever i set this
> variable to, the JDBC driver seems to use UTF-8.

This is entirely intentional. See below.

> I have digged into the code, and seen that in the
> AbstractJdbc1Connection.java class, the encoding is always forced to
> "UNICODE" (therefore forcing UTF-8 on Java side).
>>From that, i patched the code to correctly use the file.encoding system
> property to guess the charset.
>
> As i didn't dig very long, and as it seems from what i see in cvsweb at
> gborg that all this stuff could have changed deeply, i am not sure that
> this would be useful to you. However i downloaded the latest dev builds at
> jdbc.postgresql.org, and it seems the bad behaviour is still there.
>
> So, did i miss something somewhere ? Are you interested in that (frankly
> quite ugly) patch ?

This change doesn't make sense.

The internal representation of Java strings is UTF-16 always. So it
doesn't really matter whether you do:

db encoding -> UTF-8 (done by the server)
UTF-8 -> UTF-16 Java string (done trivially by the driver)

or:

look up db encoding to know how to transcode
db encoding -> UTF-16 Java string (done by the driver)

other than if you do the second option, you have to do a lot more
(unnecessary) work on the driver side. Either way, you still have to
somehow transcode the DB data into unicode.

Using file.encoding as a basis for which encoding to use is horribly
broken anyway -- what if that encoding does not match the actual DB
charset? Whatever transcoding happens really needs to be done based on
the actual DB encoding in use.

I'd suggest that your real problem is that you do not have your database
encoding set correctly. If server_encoding is correct, then the server
will do the correct transcoding to UNICODE and everything will be happy
-- you will get correctly formed Java strings and can then encode those
using whatever output encoding you like. If server_encoding is
SQL_ASCII, everything will break horribly as the server has no idea how
the raw data is actually encoded and can't transcode.

If you're exclusively using JDBC to access the database, a UNICODE
database encoding is the right choice since it means the server does not
need to transcode at all when talking to JDBC. It's probably the right
choice even with mixed clients unless you have other clients that don't
understand client_encoding.

This is getting to be a FAQ -- I'm actually looking at disabling support
for JDBC access to SQL_ASCII databases entirely since it breaks so
unpredictably.

-O

From:	Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-07 08:41:25
Message-ID:	20040807084125.A6091400E5@smtp.ies.inet6.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Ok, seems i was really really unable to explain my problem...

1) Database's encoding is set to LATIN1 (we have SQL_ASCII nowhere)
2) JDBC driver requests data to database in UNICODE (hard-coded in driver)
3) String coming from database therefore are UTF-8-encoded. And they are
correctly transcoded from LATIN1, as the encoding is correctly specified
in the pg_database for that database.
4) Java stores internally as UTF-16... but that's only the internal
representation. Now there seems to be a problem here (see description of
the work-around below).
5) Java's file.encoding system property is set to ISO-8859-1 (because we
have other data coming from LDAP or filesystem, which are encoded in
ISO-8859-1 anyway)
6) Our web app choses to display Java Strings accordingly to
file.encoding, therefore as ISO-8859-1
7) Bing ! problem: we are now interpreting UTF8-encoded strings (see point
2/3) as ISO-8859-1
Therefore all the accentuated characters go wrong !
In all previous versions of the JDBC driver (we started with the one
coming along with postgresql 7.0 series) coupled with the corresponding
version of postgresql, the data was correctly retrieved.

Now, a working work-around looks like:
String correctString = new
String(stringFromJdbcDriver.getBytes("ISO-8859-1"),"UTF-8");
As i interpret it, the java internal transcoding in the driver, from UTF-8
to UTF-16 didn't occur correctly (for some reason the strings were
interpreted as ISO-8859-1 instead of UTF-8, whereas the server was
correctly sending UTF-8/UNICODE strings as requested. and, considering
that ISO-8859-1 and UTF-8 are both 8 bits charsets, this interpretation is
technically possible, but practically completely wrong).
Now this quick and dirty work around is really dirty, and we cannot use it
in production.
My patch eliminates the problem, because the JDBC driver gets ISO-8859-1
(aka LATIN1) strings from the server, therefore java internal transcoding
into UTF-16 goes ok...

Is there some property/field/parameter somewhere that we didn't set
correctly ?

Oh, and server_encoding is set to LATIN1 in the database. Is that wrong
(our data is in LATIN1) ? When doing requests from command-line psql, we
still get the data correctly... wether we launch it as UTF-8 or
ISO-8859-1: strings always come with the requested encoding, meaning that
it's 100% sure that the server transcodes correctly.

Regards,

Alexandre Aufrere

----------------------------------------------------
De : Oliver Jowett <oliver(at)opencloud(dot)com>
A : alexandre(dot)aufrere(at)inet6(dot)fr
Objet : Re: [JDBC] Problems with charsets, investigated...
Date : Sat, 07 Aug 2004 10:06:30 +1200
> Alexandre Aufrere wrote:
> > Hello,
> >
> > I am using Postgresql 7.4.2 and its JDBC drivers, straight out from a
FC2,
> > along with JDK 1.4.2 from Sun.
> > I use the JDBC driver in a web app using Enhydra appserver. Java
correctly
> > sets its file.encoding property to the charset specified in the LANG
> > environment variable. However, it appears that whatever i set this
> > variable to, the JDBC driver seems to use UTF-8.
>
> This is entirely intentional. See below.
>
> > I have digged into the code, and seen that in the
> > AbstractJdbc1Connection.java class, the encoding is always forced to
> > "UNICODE" (therefore forcing UTF-8 on Java side).
> >>From that, i patched the code to correctly use the file.encoding
system
> > property to guess the charset.
> >
> > As i didn't dig very long, and as it seems from what i see in cvsweb
at
> > gborg that all this stuff could have changed deeply, i am not sure
that
> > this would be useful to you. However i downloaded the latest dev
builds at
> > jdbc.postgresql.org, and it seems the bad behaviour is still there.
> >
> > So, did i miss something somewhere ? Are you interested in that
(frankly
> > quite ugly) patch ?
>
> This change doesn't make sense.
>
> The internal representation of Java strings is UTF-16 always. So it
> doesn't really matter whether you do:
>
> db encoding -> UTF-8 (done by the server)
> UTF-8 -> UTF-16 Java string (done trivially by the driver)
>
> or:
>
> look up db encoding to know how to transcode
> db encoding -> UTF-16 Java string (done by the driver)
>
> other than if you do the second option, you have to do a lot more
> (unnecessary) work on the driver side. Either way, you still have to
> somehow transcode the DB data into unicode.
>
> Using file.encoding as a basis for which encoding to use is horribly
> broken anyway -- what if that encoding does not match the actual DB
> charset? Whatever transcoding happens really needs to be done based on
> the actual DB encoding in use.
>
> I'd suggest that your real problem is that you do not have your database
> encoding set correctly. If server_encoding is correct, then the server
> will do the correct transcoding to UNICODE and everything will be happy
> -- you will get correctly formed Java strings and can then encode those
> using whatever output encoding you like. If server_encoding is
> SQL_ASCII, everything will break horribly as the server has no idea how
> the raw data is actually encoded and can't transcode.
>
> If you're exclusively using JDBC to access the database, a UNICODE
> database encoding is the right choice since it means the server does not
> need to transcode at all when talking to JDBC. It's probably the right
> choice even with mixed clients unless you have other clients that don't
> understand client_encoding.
>
> This is getting to be a FAQ -- I'm actually looking at disabling support
> for JDBC access to SQL_ASCII databases entirely since it breaks so
> unpredictably.
>
> -O

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	alexandre(dot)aufrere(at)inet6(dot)fr
Cc:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-08 02:29:15
Message-ID:	41158FFB.9060101@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

(I'd appreciate a cc: on list posts as the pgsql lists can be unreliable)

Alexandre Aufrere wrote:
> Ok, seems i was really really unable to explain my problem...
>
> 1) Database's encoding is set to LATIN1 (we have SQL_ASCII nowhere)
> 2) JDBC driver requests data to database in UNICODE (hard-coded in driver)
> 3) String coming from database therefore are UTF-8-encoded. And they are
> correctly transcoded from LATIN1, as the encoding is correctly specified
> in the pg_database for that database.

This all sounds correct.

> 4) Java stores internally as UTF-16... but that's only the internal
> representation. Now there seems to be a problem here (see description of
> the work-around below).

The internal representation is always UTF-16, yes -- you must transcode
on output in general.

> 5) Java's file.encoding system property is set to ISO-8859-1 (because we
> have other data coming from LDAP or filesystem, which are encoded in
> ISO-8859-1 anyway)
> 6) Our web app choses to display Java Strings accordingly to
> file.encoding, therefore as ISO-8859-1
> 7) Bing ! problem: we are now interpreting UTF8-encoded strings (see point
> 2/3) as ISO-8859-1
> Therefore all the accentuated characters go wrong !

This implies that your web app is not transcoding correctly from UTF-16
(internal string representation) to ISO-8859-1.

How does your web app use file.encoding exactly? Note that the
file.encoding property does *not* control the default encoding used by
String.getBytes(), as I understand it; the default eencoding is
JVM-controlled from the system's locale settings.

> In all previous versions of the JDBC driver (we started with the one
> coming along with postgresql 7.0 series) coupled with the corresponding
> version of postgresql, the data was correctly retrieved.

I think this is luck of the draw more than anything..

> Now, a working work-around looks like:
> String correctString = new
> String(stringFromJdbcDriver.getBytes("ISO-8859-1"),"UTF-8");

This doesn't make sense at all! This means you are interpreting
ISO-8859-1 encoded bytes as UTF-8, which is nonsense.

> My patch eliminates the problem, because the JDBC driver gets ISO-8859-1
> (aka LATIN1) strings from the server, therefore java internal transcoding
> into UTF-16 goes ok...

It's still the wrong thing to do! I'm sure there is another bug here
that is causing the underlying problem. There should be no problem with
converting from client_encoding = UNICODE to Java's UTF-16.

What driver version *exactly* are you using? It's possible that you've
hit a driver bug of some sort that is fixed in the current driver
(specifically, I think build 302 was broken wrt. UTF-8 conversions --
but it was only available briefly). Have you tried with the current
development driver from jdbc.postgresql.org?

Can you show me the code your web app uses to display the Strings it
gets from the driver in ISO-8859-1?

Can you dump out the *characters* of the problem Strings you get from
the driver, one character at a time, and see what numeric values you're
getting and whether they are the right UTF-16 values you expect? i.e.

for (int i = 0; i < str.length(); ++i) {
System.out.println(" offset " + i + " value " + (int)str.charAt(i));
}

Can you provide a pg_dump (LATIN1 encoding I assume) plus sample
testcase that shows off the problem?

-O

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	alexandre(dot)aufrere(at)inet6(dot)fr
Cc:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-08 02:44:58
Message-ID:	411593AA.3050906@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Oliver Jowett wrote:

> (specifically, I think build 302 was broken wrt. UTF-8 conversions --
> but it was only available briefly).

Sorry, it was build 303 that was broken.

-O

From:	Jan de Visser <jdevisser(at)digitalfairway(dot)com>
To:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-08 03:12:36
Message-ID:	200408072312.36614.jdevisser@digitalfairway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

On August 7, 2004 10:29 pm, Oliver Jowett wrote:
> > 6) Our web app choses to display Java Strings accordingly to
> > file.encoding, therefore as ISO-8859-1
> > 7) Bing ! problem: we are now interpreting UTF8-encoded strings (see
> > point 2/3) as ISO-8859-1
> > Therefore all the accentuated characters go wrong !
>
> This implies that your web app is not transcoding correctly from UTF-16
> (internal string representation) to ISO-8859-1.
>
> How does your web app use file.encoding exactly? Note that the
> file.encoding property does *not* control the default encoding used by
> String.getBytes(), as I understand it; the default eencoding is
> JVM-controlled from the system's locale settings.

Hrm. This rings a bell. We use JBoss 3.2.3, which ships with a broken tomcat.
That particular tomcat version hardcodes a transcoding to LATIN1. Caused us a
lot of hair-pulling, and we fixed it by patching the offending code in tomcat
(in coyote, actually). This is the README in our lib-hacks CVS dir:

"
This directory contains a patch for tomcat/coyote 4.1.29 as shipped with jboss
3.2.3. It will set the default encoding to UTF-8 as opposed to ISO-8859-1,
and will set the encoding on the query string as well as on the request body.

The build script will compile the patched files, and add the patched classes
to the tomcat jars stored in jboss/server/deploy/jbossweb-tomcat41.sar.
"

Could this be his problem?

JdV!!

--
--------------------------------------------------------------
Jan de Visser jdevisser(at)digitalfairway(dot)com

Baruk Khazad! Khazad ai-menu!
--------------------------------------------------------------

From:	Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To:	pgsql-jdbc(at)postgresql(dot)org
Cc:	jdevisser(at)digitalfairway(dot)com
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-08 08:04:02
Message-ID:	20040808080402.ED4E5400E5@smtp.ies.inet6.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Well, thanx for the info, but i doubt that it's the problem. I do a debug
output (in command line, therefore not thru tomcat), after just getting
the data from Enhydra's DODS (relational-object layer).
So either the bug's in DODS (but why then it didn't show up with all
previous versions ?), or it's sthg in the JDBC driver. For looking thru
DODS' templates, i know that no copy is made (i mean there's no getBytes()
involved) so the problem shouldn't be here...

Thanx for your clue though, it's interesting to know !

Alexandre Aufrere

----------------------------------------------------
De : Jan de Visser <jdevisser(at)digitalfairway(dot)com>
A : pgsql-jdbc(at)postgresql(dot)org
Objet : Re: [JDBC] Problems with charsets, investigated...
Date : Sat, 7 Aug 2004 23:12:36 -0400
> On August 7, 2004 10:29 pm, Oliver Jowett wrote:
> > > 6) Our web app choses to display Java Strings accordingly to
> > > file.encoding, therefore as ISO-8859-1
> > > 7) Bing ! problem: we are now interpreting UTF8-encoded strings (see
> > > point 2/3) as ISO-8859-1
> > > Therefore all the accentuated characters go wrong !
> >
> > This implies that your web app is not transcoding correctly from
UTF-16
> > (internal string representation) to ISO-8859-1.
> >
> > How does your web app use file.encoding exactly? Note that the
> > file.encoding property does *not* control the default encoding used by
> > String.getBytes(), as I understand it; the default eencoding is
> > JVM-controlled from the system's locale settings.
>
> Hrm. This rings a bell. We use JBoss 3.2.3, which ships with a broken
tomcat.
> That particular tomcat version hardcodes a transcoding to LATIN1. Caused
us a
> lot of hair-pulling, and we fixed it by patching the offending code in
tomcat
> (in coyote, actually). This is the README in our lib-hacks CVS dir:
>
> "
> This directory contains a patch for tomcat/coyote 4.1.29 as shipped with
jboss
> 3.2.3. It will set the default encoding to UTF-8 as opposed to
ISO-8859-1,
> and will set the encoding on the query string as well as on the request
body.
>
> The build script will compile the patched files, and add the patched
classes
> to the tomcat jars stored in jboss/server/deploy/jbossweb-tomcat41.sar.
> "
>
> Could this be his problem?
>
> JdV!!
>
> --
> --------------------------------------------------------------
> Jan de Visser jdevisser(at)digitalfairway(dot)com
>
> Baruk Khazad! Khazad ai-menu!
> --------------------------------------------------------------
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

From:	Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To:	oliver(at)opencloud(dot)com
Cc:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-08 08:18:54
Message-ID:	20040808081854.793B0400E5@smtp.ies.inet6.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

----------------------------------------------------
De : Oliver Jowett <oliver(at)opencloud(dot)com>
A : alexandre(dot)aufrere(at)inet6(dot)fr
Objet : Re: [JDBC] Problems with charsets, investigated...
Date : Sun, 08 Aug 2004 14:29:15 +1200
> > 5) Java's file.encoding system property is set to ISO-8859-1 (because
we
> > have other data coming from LDAP or filesystem, which are encoded in
> > ISO-8859-1 anyway)
> > 6) Our web app choses to display Java Strings accordingly to
> > file.encoding, therefore as ISO-8859-1
> > 7) Bing ! problem: we are now interpreting UTF8-encoded strings (see
point
> > 2/3) as ISO-8859-1
> > Therefore all the accentuated characters go wrong !
>
> This implies that your web app is not transcoding correctly from UTF-16
> (internal string representation) to ISO-8859-1.

ok, but to test, we simply do a debug output (nothing more than a
System.out) of the strings. normally it's java itself that does the
transcoding there accordingly to the environment variables, no ? moreover,
if we read a string from filesystem or LDAP, ISO-8859-1-encoded, it is
displayed correctly in the debug output.

> How does your web app use file.encoding exactly? Note that the
> file.encoding property does *not* control the default encoding used by
> String.getBytes(), as I understand it; the default eencoding is
> JVM-controlled from the system's locale settings.

all system locale settings (ie LANG/LC_* environment variables) are
correctly set to en_US.iso-8859-1. file.encoding property normally only
reflects that.

> > In all previous versions of the JDBC driver (we started with the one
> > coming along with postgresql 7.0 series) coupled with the
corresponding
> > version of postgresql, the data was correctly retrieved.
>
> I think this is luck of the draw more than anything..
>
> > Now, a working work-around looks like:
> > String correctString = new
> > String(stringFromJdbcDriver.getBytes("ISO-8859-1"),"UTF-8");
>
> This doesn't make sense at all! This means you are interpreting
> ISO-8859-1 encoded bytes as UTF-8, which is nonsense.

it makes sense if, when inputting to java, UTF-8 strings were presented to
java as ISO-8859-1: as both are 8-bits charsets, an UTF-8 strings
technically makes sense in ISO-8859-1 encoding.
for instance, the word 'mère', encoded as UTF-8 and displayed as
ISO-8859-1 will give sthg like 'mÃ"re'. then java transcode that thing
into UTF-16, thinking that it's ISO-8859-1, when it's actually UTF-8. that
ugly work-around simply does the reverse job.

> > My patch eliminates the problem, because the JDBC driver gets
ISO-8859-1
> > (aka LATIN1) strings from the server, therefore java internal
transcoding
> > into UTF-16 goes ok...
>
> It's still the wrong thing to do! I'm sure there is another bug here
> that is causing the underlying problem. There should be no problem with
> converting from client_encoding = UNICODE to Java's UTF-16.

yes, i agree it sounds extremely strange ! however the problem seems to
stand there.

> What driver version *exactly* are you using? It's possible that you've
> hit a driver bug of some sort that is fixed in the current driver
> (specifically, I think build 302 was broken wrt. UTF-8 conversions --
> but it was only available briefly). Have you tried with the current
> development driver from jdbc.postgresql.org?

as i've said in my first posts, i'm using the driver that comes along with
FC2, and i've tried all the drivers available on jdbc.postgresql.org

> Can you show me the code your web app uses to display the Strings it
> gets from the driver in ISO-8859-1?
>
> Can you dump out the *characters* of the problem Strings you get from
> the driver, one character at a time, and see what numeric values you're
> getting and whether they are the right UTF-16 values you expect? i.e.
>
> for (int i = 0; i < str.length(); ++i) {
> System.out.println(" offset " + i + " value " + (int)str.charAt(i));
> }
>
> Can you provide a pg_dump (LATIN1 encoding I assume) plus sample
> testcase that shows off the problem?

well, i'll investigate more tomorrow, at work, and try to set up a simple
test program to try to understand deeper what's going on.
currently, we see the problem by doing a debug output (simply a
System.out) from Enhydra's DODS (which is the relational-object layer).
>From what i've seen in DODS (maybe, though, i didn't dig enough), DODS
does not manipulate Strings coming from the JDBC driver when they are of
type VARCHAR, therefore it shouldn't be the source of the problem.
about the charAt thing, it is as well not correct, i tried...

Thank you for your advices and time,

Alexandre Aufrere

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	alexandre(dot)aufrere(at)inet6(dot)fr
Cc:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-08 11:57:02
Message-ID:	4116150E.2080302@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Alexandre Aufrere wrote:

>>Can you dump out the *characters* of the problem Strings you get from
>>the driver, one character at a time, and see what numeric values you're
>>getting and whether they are the right UTF-16 values you expect? i.e.
>>
>> for (int i = 0; i < str.length(); ++i) {
>> System.out.println(" offset " + i + " value " + (int)str.charAt(i));
>> }
>>
>>Can you provide a pg_dump (LATIN1 encoding I assume) plus sample
>>testcase that shows off the problem?
>
>
> well, i'll investigate more tomorrow, at work, and try to set up a simple
> test program to try to understand deeper what's going on.
> currently, we see the problem by doing a debug output (simply a
> System.out) from Enhydra's DODS (which is the relational-object layer).
>>From what i've seen in DODS (maybe, though, i didn't dig enough), DODS
> does not manipulate Strings coming from the JDBC driver when they are of
> type VARCHAR, therefore it shouldn't be the source of the problem.
> about the charAt thing, it is as well not correct, i tried...

Can you show me the charAt output for some sample data that shows the
problem? (and the corresponding string as it's stored in the database)

-O

From:	Alexandre Aufrere <alexandre(dot)aufrere(at)inet6(dot)fr>
To:	oliver(at)opencloud(dot)com
Cc:	pgsql-jdbc(at)postgresql(dot)org
Subject:	Re: Problems with charsets, investigated...
Date:	2004-08-09 10:09:17
Message-ID:	20040809100917.22DB44815C@smtp.ies.inet6.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-jdbc

Erm, ok..
I should have investigated deeper before.
Apparently someone committed in the cvs an with an old, buggy version of
DODS, which was reading the fields with an InputStreamReader, getting the
data with getBinaryStream.
So you can imagine what was happening !
The correct new version uses getCharacterStream, and of course there is no
problem.

Thanx a lot for your help, as it helped me a lot to narrow my
investigation !

Regards,

Alexandre Aufrere