Latin1 to UTF-8 ?

Lists: pgsql-general
From: Aarni Ruuhimäki <aarni(at)kymi(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Latin1 to UTF-8 ?
Date: 2007-08-03 12:37:20
Message-ID: 200708031537.20276.aarni@kymi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

I've set up a new CentOs server with PostgreSQL 8.2.4 and initdb'ed it with
UTF-8.

Ok, and runs fine.

I have a problem with encodings, however. And mainly with the russian cyrillic
characters.

When I testdumped some dbs from the old FC / Pg 8.0.2, all Latin1, I noticed
that some of the dumps show in the Konqueror file browser as 'Plain Text
Documents' and some as 'C++ Source Files'. Both have Latin1 as client
encoding at the top of the files. Changing that gives errors, as expected.

Looking in to the plain text dumps I see all cyrillic characters as &#1056;...
and these go in display fine from the new server's UTF-8 environment.

Some of the 'C++' files have the cyrillics as 'îñåòèòåëåé'. Some have both
'îñåòèòåëåé' and &#1056;... and ofcourse the 'îñåò' characters come out wrong
and unreadable to the browser. (not sure if you an see single quoted ones,
but they look something like hebrew or similar)

I have no idea what browsers / encodings or even keyboard layouts have been
used when the data has been inserted by users through their web
interfaces ...

I tried the -F p switch as the earlier version has no -E for dumps. Same
output. Also with pg_dumpall.

I tried various encodings with iconv too.

So, what would be the proper way to convert the dumps to UTF-8 ? Or any other
solution ? Any other tool to work with the problem files ?

BR,

Aarni
--
Aarni Ruuhimäki


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-general(at)postgresql(dot)org, aarni(at)kymi(dot)com
Subject: Re: Latin1 to UTF-8 ?
Date: 2007-08-04 15:04:31
Message-ID: 200708041704.31739.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Aarni Ruuhimäki wrote:
> So, what would be the proper way to convert the dumps to UTF-8 ? Or
> any other solution ? Any other tool to work with the problem files ?

Dump them again but set your client encoding to UTF8.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/