Re: Bug in UTF8-Validation Code?

From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mario Weilguni *EXTERN*" <mweilguni(at)sime(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 14:50:25
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB201AC06DC@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mario Weilguni wrote:
>>> Steps to reproduce:
>>> create database testdb with encoding='UTF8';
>>> \c testdb
>>> create table test(x text);
>>> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>>
>> This is working as expected, see the remark in
>>
>> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
>>
>> "It is your responsibility that the byte sequences you create
>> are valid characters in the server character set encoding."
>
> In that case, pg_dump is doing wrong here and should quote the output. IMO it
> cannot be defined as working as expected, when this makes any database dumps
> worthless, without any warnings at dump-time.
>
> pg_dump should output \244 itself in that case.

True. Here is a test case on 8.2.3
(OS, database and client all use UTF8):

test=> CREATE TABLE test(x text);
CREATE TABLE
test=> INSERT INTO test VALUES ('correct: ä');
INSERT 0 1
test=> INSERT INTO test VALUES (E'incorrect: \244');
INSERT 0 1
test=> \q
laurenz:~> pg_dump -d -t test -f test.sql

Here is an excerpt from 'od -c test.sql':

0001040 e n z \n - - \n \n I N S E R T I
0001060 N T O t e s t V A L U E S
0001100 ( ' c o r r e c t : 303 244 ' ) ;
0001120 \n I N S E R T I N T O t e s
0001140 t V A L U E S ( ' i n c o r
0001160 r e c t : 244 ' ) ; \n \n \n - - \n

The invalid character (octal 244) is in the INSERT statement!

This makes psql gag:

test=> DROP TABLE test;
DROP TABLE
test=> \i test.sql
SET
SET
SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
INSERT 0 1
psql:test.sql:33: ERROR: invalid byte sequence for encoding "UTF8": 0xa4
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

A fix could be either that the server checks escape sequences for validity
or that pg_dump outputs invalid bytes as escape sequences.
Or pg_dump could stop with an error.
I think that the cleanest way would be the first.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2007-03-13 15:12:33 Re: Bug in UTF8-Validation Code?
Previous Message Teodor Sigaev 2007-03-13 14:39:05 Re: My honours project - databases using dynamically attached entity-properties