Re: invalid byte sequence for encoding "UNICODE": 0xd9

Lists: sfpug
From: Eric Walstad <eric(at)ericwalstad(dot)com>
To: SF Postgres <sfpug(at)postgresql(dot)org>
Subject: invalid byte sequence for encoding "UNICODE": 0xd9
Date: 2006-02-13 22:37:45
Message-ID: 200602131437.46440.eric@ericwalstad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: sfpug

Hi everyone,

Question: How do I keep from receiving the subject error message when
loading data?

I'm working on a project that has a table with about 2.7M records,
mostly address data, in it. I wanted to copy some of this data to a
different machine (call it 'B') for testing purposes. I receive the
subject error message when I try to load the data into B that was
copied from machine 'A'. I've searched google, and the postgresql
docs but didn't find anything obviously helpful. Any pointers to
relevent docs and/or tips are greatly appreciated.

Here's the supporting info (with some stuff changed to protect the
innocent):
Machine 'A', the source machine, is running:
psql --version
psql (PostgreSQL) 7.3.4-RH

Machine 'B', the destination machine, is running:
psql --version
psql (PostgreSQL) 8.0.6

command I used to create the sql file:
pg_dump -U user_name dbname --table=table_name > output.sql

command I used to load the data into B:
psql -q dbname < output.sql

Error message I receive:
ERROR: invalid byte sequence for encoding "UNICODE": 0xd9
CONTEXT: COPY table_name, line 330517, column column_name: "2000 FOO
ST.�

Related comments:
I don't have experience with Unicode. I'm not sure what character
0xd9 is ('Latin capital U with grave'?). The data in question is for
addresses in California, so I don't expect it to have foreign
language characters in it. I've tried deleting the trouble character
by hand, with vim, but after trying this approach 5 times, I decided
I better find out what the real problem is.

Loading the data into another machine running psql (PostgreSQL) 7.4.7
works as expected.

I used 'tar cjf' to compress the data file before transferring it from
A to B:
On Machine A
tar --version
tar (GNU tar) 1.13.25

On Machine B
tar --version
tar (GNU tar) 1.15.1

Both machines are running the same version of bzip2: Version 1.0.2

Machine A is:
RedHat Enterprise Linux, 2.4.21-9.0.1.EL #1 Mon Feb 9 22:26:52 EST
2004 i686 athlon i386 GNU/Linux

Machine B is:
Fedora Core 4, 2.6.15-1.1830_FC4 #1 Thu Feb 2 17:23:41 EST 2006 i686
i686 i386 GNU/Linux

Both machines have postgresql that was installed from RPM binaries.

Thanks in advance,

Eric.


From: David Fetter <david(at)fetter(dot)org>
To: SF Postgres <sfpug(at)postgresql(dot)org>
Subject: Re: invalid byte sequence for encoding "UNICODE": 0xd9
Date: 2006-02-14 05:29:39
Message-ID: 20060214052939.GB31745@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: sfpug

On Mon, Feb 13, 2006 at 02:37:45PM -0800, Eric Walstad wrote:
> Hi everyone,
>
> Question: How do I keep from receiving the subject error message when
> loading data?

I suspect you'll have to pass iconv over the dump file, as mentioned
in the release notes. You may have had the database encoded in that
abomination hiding under the mask of SQL_ASCII, which isn't really an
encoding. It's more like "any byte string without a null byte in it" :P

HTH :)

Cheers,
D
--
David Fetter david(at)fetter(dot)org http://fetter.org/
phone: +1 415 235 3778

Remember to vote!


From: Eric Walstad <eric(at)ericwalstad(dot)com>
To: sfpug(at)postgresql(dot)org
Subject: Re: invalid byte sequence for encoding "UNICODE": 0xd9
Date: 2006-02-14 21:31:42
Message-ID: 200602141331.43643.eric@ericwalstad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: sfpug

On Monday 13 February 2006 21:29, David Fetter wrote:
> On Mon, Feb 13, 2006 at 02:37:45PM -0800, Eric Walstad wrote:
> > Hi everyone,
> >
> > Question: How do I keep from receiving the subject error message when
> > loading data?
>
> I suspect you'll have to pass iconv over the dump file, as mentioned
> in the release notes. You may have had the database encoded in that
> abomination hiding under the mask of SQL_ASCII, which isn't really an
> encoding. It's more like "any byte string without a null byte in it" :P
>
> HTH :)
>
> Cheers,
> D

Thanks for pointing me in the right direction, David.

I found the relevant section of the release notes here:
<http://www.postgresql.org/docs/current/interactive/release-8-1.html#AEN72739>

I first split my big dump file into managable chunks:

mkdir tmp
cd tmp
split -C 25000000 ../output.sql

Then I ran iconv on all the split files, using the command line suggested in
the release notes:

for SPLIT_FILE in xa*
do
iconv -f UTF-8 -t UTF-8 $SPLIT_FILE >> converted.sql
done

That, unfortunately, removed some other important bits of data (tabs, I think,
next to the invalid unicode characters). However, iconv did output messages
when it encountered the invalid characters (with byte offsets, I think) which
told me where the problems were located and in which split files. I was then
able to go into each split file and delete the characters by hand with vim,
cat all the split files back together and load all the data successfully.

My postgresql.conf file has the encoding line commented out:

#client_encoding = sql_ascii # actually, defaults to database encoding

No database encoding was specified when I created the database with createdb.
I suspect that means 'sql_ascii' was used, but I didn't find where the
default database encoding is specified so I don't know for sure.

Thanks again,

Eric.