Differences in UTF8 between 8.0 and 8.1

Lists: pgsql-hackers
From: Paul Lindner <lindner(at)inuus(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-22 15:48:27
Message-ID: 20051022154827.GC27646@inuus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been doing some test imports of UNICODE databases into Postgres
8.1beta3. The only problem I've seen is that some data from 8.0
databases will not import.

I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to
restore these results in

Invalid UNICODE byte sequence detected near byte ...

Question:

Does the 8.1 Unicode sanity code accept the full set of characters
accepted by the 8.0 Unicode sanity code?

If not we'll see a lot of problems like the one above.

I believe this patch is the one causing the problem I see:

http://www.mail-archive.com/pgsql-patches(at)postgresql(dot)org/msg08198/unicode.diff

Is there any solution other than scrubbing the entire dataset to
conform to the new (8.1) encoding rules?

--
Paul Lindner ||||| | | | | | | | | |
lindner(at)inuus(dot)com


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-23 05:56:50
Message-ID: slrndlm9h2.g61.andrew+nonews@trinity.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2005-10-22, Paul Lindner <lindner(at)inuus(dot)com> wrote:
> I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to
> restore these results in
>
> Invalid UNICODE byte sequence detected near byte ...

What were the exact offending bytes?

> Question:
>
> Does the 8.1 Unicode sanity code accept the full set of characters
> accepted by the 8.0 Unicode sanity code?

No. 8.0 and before accepted a lot of stuff that it should never have, and
failed to accept stuff that it should have.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Paul Lindner <lindner(at)inuus(dot)com>
To: andrew(at)supernews(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-24 03:48:00
Message-ID: 20051024034800.GD27646@inuus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Oct 23, 2005 at 05:56:50AM -0000, Andrew - Supernews wrote:
> On 2005-10-22, Paul Lindner <lindner(at)inuus(dot)com> wrote:
> > I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to
> > restore these results in
> >
> > Invalid UNICODE byte sequence detected near byte ...
>
> What were the exact offending bytes?

Here's a cut and paste from emacs hexl-mode:

00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55 5083c5083.< VALU
00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4 ES (202553, '...
00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056 .....', .---.> V
00000030: 414c 5545 5320 2832 3032 3535 332c 2027 ALUES (202553, '
00000040: d2b9 272c 200a 3136 3939 3432 6331 3639 ..', .169942c169
00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833 942.< VALUES (83
00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a 015, '.....H', .
00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833 ---.> VALUES (83
00000080: 3031 352c 2027 c6a4 4827 2c20 0a 015, '..H', .

This is of a minimal diff between a UTF8 scrubbed file and the
original dump.

It appears the offending bytes are:

C1 F9 C2 D0 C7

and

B7 ED A8

> > Question:
> >
> > Does the 8.1 Unicode sanity code accept the full set of characters
> > accepted by the 8.0 Unicode sanity code?
>
> No. 8.0 and before accepted a lot of stuff that it should never have, and
> failed to accept stuff that it should have.

Thanks go out to John Hansen, he recommended to run the dump through iconv:

iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql

This seems to strip out invalid UTF8 and will allow for a clean
import. Someone should add this to the Release Notes/FAQ..

--
Paul Lindner ||||| | | | | | | | | |
lindner(at)inuus(dot)com


From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: Paul Lindner <lindner(at)inuus(dot)com>
Cc: andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-24 05:06:02
Message-ID: 435C6BBA.4070004@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Thanks go out to John Hansen, he recommended to run the dump through iconv:
>
> iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql
>
> This seems to strip out invalid UTF8 and will allow for a clean
> import. Someone should add this to the Release Notes/FAQ..

Yes I think that's extremely important to put in the 'upgrading/back
compatibiliy' section for these release notes.

Chris


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-24 05:07:40
Message-ID: slrndlor0s.g61.andrew+nonews@trinity.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2005-10-24, Paul Lindner <lindner(at)inuus(dot)com> wrote:
> Here's a cut and paste from emacs hexl-mode:
>
> 00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55 5083c5083.< VALU
> 00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4 ES (202553, '...
> 00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056 .....', .---.> V
> 00000030: 414c 5545 5320 2832 3032 3535 332c 2027 ALUES (202553, '
> 00000040: d2b9 272c 200a 3136 3939 3432 6331 3639 ..', .169942c169
> 00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833 942.< VALUES (83
> 00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a 015, '.....H', .
> 00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833 ---.> VALUES (83
> 00000080: 3031 352c 2027 c6a4 4827 2c20 0a 015, '..H', .
>
> This is of a minimal diff between a UTF8 scrubbed file and the
> original dump.
>
> It appears the offending bytes are:
>
> C1 F9 C2 D0 C7

I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
was never actually a valid utf-8 string, and that the d2 b9 is only valid
by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0
utf-8 check was broken, but I didn't realize it was quite so bad.

> and
>
> B7 ED A8

Likewise, that whole sequence b7 ed a8 c6 a4 was probably never valid;
c6 a4 also isn't a character you'd expect to find in common use.

My guess is that this was data in some non-utf-8 charset that managed to
get past the defective checks in 8.0.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Paul Lindner <lindner(at)inuus(dot)com>
To: andrew(at)supernews(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-27 00:59:51
Message-ID: 20051027005951.GA27655@inuus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>
> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
> was never actually a valid utf-8 string, and that the d2 b9 is only valid
> by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0
> utf-8 check was broken, but I didn't realize it was quite so bad.

Looking at the data it appears that it is a sequence of latin1
characters. They all have the eighth bit set and all seem to pass the
check.

In a million rows I found 2 examples of this.

However I'm running into another problem now. The command:

iconv -c -f UTF8 -t UTF8

does strip out the invalid characters. However, iconv reads the
entire file into memory before it writes out any data. This is not so
good for multi-gigabyte dump files and doesn't allow for it to be used
in a pipe between pg_dump and psql.

Anyone have any other recommendations? GNU recode might do it, but
I'm a bit stymied by the syntax. A quick perl script using
Text::Iconv didn't work either. I'm off to look at some other perl
modules and will try to create a script so I can strip out the invalid
characters.

--
Paul Lindner ||||| | | | | | | | | |
lindner(at)inuus(dot)com


From: Andrej Ricnik-Bay <andrej(dot)groups(at)gmail(dot)com>
To: Paul Lindner <lindner(at)inuus(dot)com>
Cc: andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-27 01:40:20
Message-ID: b35603930510261840v17d6a50dwba1e8dd6012654f7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> does strip out the invalid characters. However, iconv reads the
> entire file into memory before it writes out any data. This is not so
> good for multi-gigabyte dump files and doesn't allow for it to be used
> in a pipe between pg_dump and psql.
>
> Anyone have any other recommendations? GNU recode might do it, but
> I'm a bit stymied by the syntax. A quick perl script using
> Text::Iconv didn't work either. I'm off to look at some other perl
> modules and will try to create a script so I can strip out the invalid
> characters.
How about an ugly kludge ...

split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile
for i in `ls -1 dumpfile*`; do iconv -c -f UTF8 -t UTF8 $i;done
cat dumpfile* > new_dump

Cheers,
Andrej


From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: Paul Lindner <lindner(at)inuus(dot)com>
Cc: andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-27 01:49:48
Message-ID: 4360323C.2020701@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> However I'm running into another problem now. The command:
>
> iconv -c -f UTF8 -t UTF8
>
> does strip out the invalid characters. However, iconv reads the
> entire file into memory before it writes out any data. This is not so
> good for multi-gigabyte dump files and doesn't allow for it to be used
> in a pipe between pg_dump and psql.
>
> Anyone have any other recommendations? GNU recode might do it, but
> I'm a bit stymied by the syntax. A quick perl script using
> Text::Iconv didn't work either. I'm off to look at some other perl
> modules and will try to create a script so I can strip out the invalid
> characters.

recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql

Chris


From: jtv(at)xs4all(dot)nl
To: "Andrej Ricnik-Bay" <andrej(dot)groups(at)gmail(dot)com>
Cc: "Paul Lindner" <lindner(at)inuus(dot)com>, andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-27 05:26:17
Message-ID: 24573.125.24.8.136.1130390777.squirrel@125.24.8.136
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrej Ricnik-Bay wrote:

> How about an ugly kludge ...
>
> split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile
> for i in `ls -1 dumpfile*`; do iconv -c -f UTF8 -t UTF8 $i;done
> cat dumpfile* > new_dump

Not with UTF-8... You might break in the middle of a multibyte character.

Jeroen


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-27 11:56:02
Message-ID: slrndm1g2i.g61.andrew+nonews@trinity.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2005-10-27, Paul Lindner <lindner(at)inuus(dot)com> wrote:
> On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
>> was never actually a valid utf-8 string, and that the d2 b9 is only valid
>> by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0
>> utf-8 check was broken, but I didn't realize it was quite so bad.
>
> Looking at the data it appears that it is a sequence of latin1
> characters. They all have the eighth bit set and all seem to pass the
> check.

In latin1 it comes out as total gibberish, so I think you'll find it is
actually in something else. Some googling suggests it is most likely in a
Chinese double-byte charset (GB2312).

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: Paul Lindner <lindner(at)inuus(dot)com>, andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-10-30 16:49:41
Message-ID: e692861c0510300849x36ce4907r7284b526feeef7a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/26/05, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> wrote:
> > iconv -c -f UTF8 -t UTF8
> recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql

I've got a file with characters that pg won't accept that recode does
not fix but iconv does. Iconv is fine for my application, so I'm just
posting to the list so that anyone looking for why recode didn't work
for them will find the suggestion to use iconv.


From: Paul Lindner <lindner(at)inuus(dot)com>
To: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Paul Lindner <lindner(at)inuus(dot)com>, andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Differences in UTF8 between 8.0 and 8.1
Date: 2005-11-01 12:15:50
Message-ID: 20051101121550.GE23652@inuus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Oct 30, 2005 at 11:49:41AM -0500, Gregory Maxwell wrote:
> On 10/26/05, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> wrote:
> > > iconv -c -f UTF8 -t UTF8
> > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql
>
> I've got a file with characters that pg won't accept that recode does
> not fix but iconv does. Iconv is fine for my application, so I'm just
> posting to the list so that anyone looking for why recode didn't work
> for them will find the suggestion to use iconv.

recode did not work for my sample data. It passed through the problem
character sequences. I'm still looking for an iconv that doesn't read
the entire file into memory.

At this point I'm looking to use the split command to process input in
10000 line chunks. Sadly that can't be used in a pipe.

BTW, how will sites that use Slony deal with this issue?

--
Paul Lindner ||||| | | | | | | | | |
lindner(at)inuus(dot)com