Bytea misconceptions

Lists: pgsql-hackers
From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Bytea misconceptions
Date: 2003-02-19 11:45:11
Message-ID: Pine.LNX.4.44.0302191235290.1714-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The bytea type seems to be liable to character set conversions to the
effect that it falsifies the stored data.

Example: Create a cluster with non-C CTYPE, create a LATIN1 database,
create a table with a bytea column, and store something with non-ASCII
characters in it. Then change the client encoding (to UNICODE, say) and
read the data. I stored 'ätsch bätsch' and got 'Àtsch bÀtsch', which is
not a suitable result for bytea data.

The bytea output function uses isprint() to determine which characters not
to escape, which fails to give at least the documented results in most
locales. In general, the only safe solution would be to escape *all* byte
values on output. Then the client can reconstruct the byte sequence based
on the character entities in the delivered string and does not have to
rely on the character codes staying the same during the conversion.
(Alternatively, we do not pass bytea values through the character set
conversion, but that might be unfeasible for other reasons.)

--
Peter Eisentraut peter_e(at)gmx(dot)net


From: Joe Conway <mail(at)joeconway(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bytea misconceptions
Date: 2003-02-19 17:47:10
Message-ID: 3E53C31E.6030707@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> In general, the only safe solution would be to escape *all* byte
> values on output. Then the client can reconstruct the byte sequence based
> on the character entities in the delivered string and does not have to
> rely on the character codes staying the same during the conversion.

Seems like this brings us back to using hex for bytea, ala BLOB in
SQL99. What would be the implications of changing byteain and byteaout
to use X'FFFFFF' instead of '\377\377\377'?

I guess backward compatibility is a big problem. Maybe make it
configurable: all octal escaped or all hex. Is it better to create a
completely new datatype?

Joe


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bytea misconceptions
Date: 2003-02-21 09:58:33
Message-ID: Pine.LNX.4.44.0302201624410.2544-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut writes:

> Example: Create a cluster with non-C CTYPE, create a LATIN1 database,
> create a table with a bytea column, and store something with non-ASCII
> characters in it. Then change the client encoding (to UNICODE, say) and
> read the data. I stored 'ätsch bätsch' and got 'Àtsch bÀtsch', which is
> not a suitable result for bytea data.

Another point that occured to me is that if you send bytea input that does
not exclusively contain escape sequences to the server, then you really
don't know what the server will store. Since character set conversion is
supposed to be transparent, the bytea type is broken from the ground up
and should be replaced (probably by the standard blob type).

--
Peter Eisentraut peter_e(at)gmx(dot)net