Re: 8.0.0beta4: "copy" and "client_encoding"

Lists: pgsql-jdbc
From: mbch67(at)yahoo(dot)com
To: pgsql-jdbc(at)postgresql(dot)org
Subject: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-03 16:03:01
Message-ID: fe065ce.0411030803.4616e8c4@posting.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Hi,

The following statement worked fine with 7.4 but fails under 8.0.0beta4:

SET CLIENT_ENCODING TO 'LATIN1';

Error message:

08006
The servers client_encoding parameter was changed to {0}.
The JDBC driver requires client_encoding to be UNICODE for
correct operation.

Is this a bug or a new restriction of the JDBC driver/server backend?

That's the (faulty) 8.0.0beta4 configuration:

Default client encoding: LATIN1
Database encoding: UNICODE
DBMS: 8.0.0beta4
JDBC driver: pgdev.307.jdbc3.jar
File encoding: LATIN1

That's the 7.4.x configuration:

Default client encoding: LATIN1
Database encoding: UNICODE
DBMS: 7.4.x
JDBC driver: pg74.213.jdbc4.jar
File encoding: LATIN1

Adrian


From: Kris Jurka <books(at)ejurka(dot)com>
To: mbch67(at)yahoo(dot)com
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-03 19:42:36
Message-ID: Pine.BSO.4.56.0411031438410.6157@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Wed, 3 Nov 2004 mbch67(at)yahoo(dot)com wrote:

> The following statement worked fine with 7.4 but fails under 8.0.0beta4:
>
> SET CLIENT_ENCODING TO 'LATIN1';

You were never supposed to do this, but only in 8.0 do we actually detect
it and stop you. Why do you need to do this in the first place?

Kris Jurka


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: mbch67(at)yahoo(dot)com
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-03 20:12:56
Message-ID: 41893BC8.6000201@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

mbch67(at)yahoo(dot)com wrote:
> Hi,
>
> The following statement worked fine with 7.4 but fails under 8.0.0beta4:
>
> SET CLIENT_ENCODING TO 'LATIN1';
>
> Error message:
>
> 08006
> The servers client_encoding parameter was changed to {0}.
> The JDBC driver requires client_encoding to be UNICODE for
> correct operation.
>
> Is this a bug or a new restriction of the JDBC driver/server backend?

It's an old restriction of the JDBC driver. It's just that only the new
driver detects and complains about attempts to change the encoding.

The underlying problem is that currently the driver sets client_encoding
= UNICODE (which matches the internal Java string representation well)
and expects it to stay that way for the duration of the connection. If
you manually change client_encoding, then the driver will happily try to
decode those LATIN1-encoded strings as UTF8-encoded strings, and get
very confused..

Why do you need to change client_encoding?

-O


From: mbch67(at)yahoo(dot)com
To: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-04 07:08:43
Message-ID: fe065ce.0411032308.5b7a24b3@posting.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Thanks Kris and Oliver for your explanations.

Why I want to set the client_encoding to latin1? That's a fair
question. Unfortunately I removed the statement following the COPY
command to keep the posting short:

SET CLIENT_ENCODING TO 'LATIN1';
COPY temp_test FROM '/home/postgres/temp_test.dat';

I want to load a latin1-encoded file using the COPY command. Not
setting the encoding leads to the following error message:

22021
ERROR: Unicode characters greater than or equal to 0x10000 are
not supported

Executing the same COPY statement with psql works fine.

Cheers,
Adrian


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: mbch67(at)yahoo(dot)com
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-04 10:17:22
Message-ID: 418A01B2.70104@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

mbch67(at)yahoo(dot)com wrote:
> Thanks Kris and Oliver for your explanations.
>
> Why I want to set the client_encoding to latin1? That's a fair
> question. Unfortunately I removed the statement following the COPY
> command to keep the posting short:
>
> SET CLIENT_ENCODING TO 'LATIN1';
> COPY temp_test FROM '/home/postgres/temp_test.dat';
>
> I want to load a latin1-encoded file using the COPY command.

Hmm. I am suprised that COPY does not let you specify the encoding of
the input file. Using client_encoding for this seems wrong: there are
some cases it can't handle, e.g. trying to load a LATIN1-encoded file
into a table that has a name that can't be represented in LATIN1.

I suppose that in the absence of backend support for this, we could add
some URL parameter that allows client_encoding to be changed, with
suitably dangerous warnings around using it. Then you can temporarily
flip client_encoding to LATIN1 for the duration of the COPY, and revert
it to UNICODE afterwards.

-O


From: mbch67(at)yahoo(dot)com
To: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-05 08:19:32
Message-ID: fe065ce.0411050019.464d6929@posting.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

> I suppose that in the absence of backend support for this, we could add
> some URL parameter that allows client_encoding to be changed, with
> suitably dangerous warnings around using it. Then you can temporarily
> flip client_encoding to LATIN1 for the duration of the COPY, and revert
> it to UNICODE afterwards.
>

To me a temporarly fix is not really the solution. Shouldn't there be
done some more investigations first, e.g.

1. I set LATIN1 as the database (postgresql.conf) default client
encoding. Why does COPY, executed via JDBC not use the right encoding?
=> To me it seems to be a backend problem. Should this be address in
another posting list?

2. Was the decision to disable the "SET CLIENT_ENCODING" command
really a good idea? What about if I am running a server using UNICODE
to store text, my default client encoding is LATIN1 and I want to
import a Korean encoded text file using COPY via JDBC? There is no way
to tell COPY what encoding the input file based on.
In order to be compliant with PSQL I suggest to reactivate the
disabled "SET CLIENT ENCODING" for JDBC.


From: Ulrich Meis <kenobi(at)halifax(dot)rwth-aachen(dot)de>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: mbch67(at)yahoo(dot)com, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-05 14:39:37
Message-ID: 200411051539.37753.kenobi@halifax.rwth-aachen.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Thursday 04 November 2004 11:17, Oliver Jowett wrote:
>[...]
> I suppose that in the absence of backend support for this, we could add
> some URL parameter that allows client_encoding to be changed, with
> suitably dangerous warnings around using it. Then you can temporarily
> flip client_encoding to LATIN1 for the duration of the COPY, and revert
> it to UNICODE afterwards.

How about relaxing so that things like this are alright:

st.executeUpdate('SET CLIENT_ENCODING TO 'whatever'; COPY/...; SET
CLIENT_ENCODING TO 'UNICODE');

Or - I'm surprised to see that this doesn't work - calling a function that
does the trick:

create or replace function copy_enc(varchar,varchar) RETURNS void AS ' set
client_encoding TO ''LATIN1'';set client_encoding TO ''UNICODE'';' LANGUAGE
SQL;

Uli


From: Markus Schaber <schabios(at)logi-track(dot)com>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: pgsql-jdbc(at)postgresql(dot)org, mbch67(at)yahoo(dot)com
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-05 15:03:24
Message-ID: 20041105160324.43afb9e1@kingfisher.intern.logi-track.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Hi, Oliver,

On Thu, 04 Nov 2004 10:17:22 +0000
Oliver Jowett <oliver(at)opencloud(dot)com> wrote:

> Hmm. I am suprised that COPY does not let you specify the encoding of
> the input file. Using client_encoding for this seems wrong: there are
> some cases it can't handle, e.g. trying to load a LATIN1-encoded file
> into a table that has a name that can't be represented in LATIN1.
>
> I suppose that in the absence of backend support for this, we could add
> some URL parameter that allows client_encoding to be changed, with
> suitably dangerous warnings around using it. Then you can temporarily
> flip client_encoding to LATIN1 for the duration of the COPY, and revert
> it to UNICODE afterwards.

I think you're right. Except COPY from STDIN, client encoding should be
independent from COPY encoding. So I suggest that Adrian requests
backend support for this.

Greets,
Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios(at)logi-track(dot)com | www.logi-track.com


From: Kris Jurka <books(at)ejurka(dot)com>
To: Markus Schaber <schabios(at)logi-track(dot)com>
Cc: Oliver Jowett <oliver(at)opencloud(dot)com>, pgsql-jdbc(at)postgresql(dot)org, mbch67(at)yahoo(dot)com
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-05 20:27:51
Message-ID: Pine.BSO.4.56.0411051519360.28172@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Fri, 5 Nov 2004, Markus Schaber wrote:

> I think you're right. Except COPY from STDIN, client encoding should be
> independent from COPY encoding. So I suggest that Adrian requests
> backend support for this.
>

Well that's not going to get in for 8.0, so we'll need to do something
especially since the driver is supposed to offer backward compatibility.
>From my perspective we've got three options, just drop the check, Oliver's
add a URL parameter to turn off the check, or add an API so that they
don't issue the COPY command directly, but something like
issueServerCopy(table, file, encoding) where the driver could turn off the
check, switch the encoding, do the copy, switch the encoding back, and
turn on the check.

I think the third option is too much work at the moment, when we do offer
a copy API in the driver we could fold that in, but not now. As to the
first two options, I don't really care. Either is fine with me, so since
Oliver seems to like a URL parameter, that's what we'll probably go with.

Kris Jurka


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: mbch67(at)yahoo(dot)com
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-05 21:41:43
Message-ID: 418BF397.4070000@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

mbch67(at)yahoo(dot)com wrote:

> 1. I set LATIN1 as the database (postgresql.conf) default client
> encoding. Why does COPY, executed via JDBC not use the right encoding?
> => To me it seems to be a backend problem. Should this be address in
> another posting list?

The postgresql.conf setting is a default that can be overridden on a
per-client basis. JDBC overrides the default when it connects. This is
normal.

> 2. Was the decision to disable the "SET CLIENT_ENCODING" command
> really a good idea? What about if I am running a server using UNICODE
> to store text, my default client encoding is LATIN1 and I want to
> import a Korean encoded text file using COPY via JDBC? There is no way
> to tell COPY what encoding the input file based on.
> In order to be compliant with PSQL I suggest to reactivate the
> disabled "SET CLIENT ENCODING" for JDBC.

It's a good idea in the sense that if you SET CLIENT_ENCODING, you will
break the JDBC driver in nonobvious ways. The check is there as an extra
safety net. I'd be OK with a URL parameter to disable the check so that
expert users can SET CLIENT_ENCODING at their own risk, but I don't want
the check disabled by default.

It would be theoretically possible for the JDBC driver to track
client_encoding and adjust the encoding it uses accordingly, but:

1) someone needs to actually implement that
2) it is not clear exactly when the encoding changes with respect to
receiving the ParameterStatus message (this is only an issue if there
are encodings where the contents of the ParameterStatus message would
change in the new encoding)
3) it results in an extra round of transcoding (i.e. db encoding ->
client encoding -> unicode, rather than just db encoding -> unicode)

Given that the only thing that we've seen that depends on
client_encoding so far is COPY (and even that has problems), I think the
right solution is to fix COPY, not go to a lot of extra work to support
arbitary client_encoding values.

Are there any other cases where client_encoding needs to be modified by
a JDBC user? It really seems to me that client_encoding is an
implementation detail that JDBC users should not need to worry about,
given that Java already has standard mechanisms for dealing with
encodings (namely "turn everything into unicode strings internally").

====

Also, a couple of workarounds for your case that don't need driver
modifications:

- force use of protocol version 2 by adding "?protocolVersion=2" to your
connection URL; you will lose the benefits of version 3 but it should
also defeat the client_encoding checks.
- transcode the file from LATIN1 to UNICODE (UTF8) on the server side
before issuing the COPY.

-O


From: Markus Schaber <schabios(at)logi-track(dot)com>
To: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-11-09 17:33:02
Message-ID: 20041109183302.63a547d5@kingfisher.intern.logi-track.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Hi, Kris,

On Fri, 5 Nov 2004 15:27:51 -0500 (EST)
Kris Jurka <books(at)ejurka(dot)com> wrote:

> Well that's not going to get in for 8.0, so we'll need to do something
> especially since the driver is supposed to offer backward compatibility.
> From my perspective we've got three options, just drop the check, Oliver's
> add a URL parameter to turn off the check, or add an API so that they
> don't issue the COPY command directly, but something like
> issueServerCopy(table, file, encoding) where the driver could turn off the
> check, switch the encoding, do the copy, switch the encoding back, and
> turn on the check.

I would prefer to adding the URL option, maybe combined with the version
compatibility flag.

Adding an API always carries the disadvantage that the user has to do
runtime casts and add special cased code even when the sql syntax is
compatible for all capable servers.

Have fun,
Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios(at)logi-track(dot)com | www.logi-track.com


From: Kris Jurka <books(at)ejurka(dot)com>
To: mbch67(at)yahoo(dot)com
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: 8.0.0beta4: "copy" and "client_encoding"
Date: 2004-12-11 03:38:32
Message-ID: Pine.BSO.4.56.0412102237230.6141@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Thu, 3 Nov 2004 mbch67(at)yahoo(dot)com wrote:

> Why I want to set the client_encoding to latin1? That's a fair
> question. Unfortunately I removed the statement following the COPY
> command to keep the posting short:
>
> SET CLIENT_ENCODING TO 'LATIN1';
> COPY temp_test FROM '/home/postgres/temp_test.dat';
>

I have added a URL parameter allowEncodingChanges=true to allow client
code to set the client_encoding variable.

Kris Jurka