Quick Links

Convert string to UNICODE & migration FROM 7.4 to 9.1

Lists:	pgsql-general

From:	Philippe Lang <philippe(dot)lang(at)attiksystem(dot)ch>
To:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Convert string to UNICODE & migration FROM 7.4 to 9.1
Date:	2011-11-24 18:33:01
Message-ID:	21F9A2B0E25A7F4497D7B87397AD9D7C84264263@SBS1.attiksystem.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Hi,

I'm migrating a PG 7.4 database (encoded in LATIN1) to PG 9.1. The upgrade is just fine, except a problem with a conversion to UNICODE for which I was not able to find a solution yet:

I have a function under PG 7.4 that returns an xml structure, with the content encoded in UNICODE.

The function is like:

---------------------------------------------
CREATE OR REPLACE FUNCTION get_xml()
RETURNS text AS
$$
DECLARE
output text;
BEGIN
-- We fill the output variable with xml
...
-- We return it in unicode
RETURN convert(output, 'LATIN1', 'UTF8');
END;
$$
LANGUAGE 'plpgsql';
---------------------------------------------

After migrating to PG 9.1, I noticed that convert now requires a bytea, and not a text. I tried different things:

- Change the output variable to a bytea
- Use RETURN convert(convert_to(output, 'LATIN1'), 'LATIN1', 'UTF8');
- Encode the database in UTF8 instead of LATIN1

... but no output is similar to what I had under PG 7.4.

More precisely, I had under PG 7.4 something like (notice the Name Hélène converted into HÃ©lÃ¨ne)

"<?xml version="1.0" encoding="UTF-8"?>
<DispatchAuftrag>
<Versicherungsnehmer>
<Name>Martelli</Name>
<Vorname>HÃ©lÃ¨ne</Vorname>
<Strasse>rue des Comptes</Strasse>
<Land>Suisse</Land>
<PLZ>123456</PLZ>
<Ort>Fribourg</Ort>
<Email>.</Email>
<TelMobil>.</TelMobil>
<TelPrivat>.</TelPrivat>
<TelGeschaeft>.</TelGeschaeft>
<Fax>.</Fax>
</Versicherungsnehmer>
</DispatchAuftrag>
"

And now I get something like:

"<?xml version="1.0" encoding="UTF-8"?>\012<DispatchAuftrag>\012 </Versicherungsnehmer>\012 <Geschaedigter>\012 <Name>Etat du Valais</Name>\012 <Vorname></Vorname>\012 <Strasse>Indivis / Centre entretien Autoroute</Strasse>\012 <Land>Suisse</Land>\012 <PLZ>1906</PLZ>\012 <Ort>Charrat</Ort>\012 <Email></Email>\012 <TelMobil></TelMobil>\012 <TelPrivat>027 747 61 00</TelPrivat>\012 <TelGeschaeft>.</TelGeschaeft>\012 <Fax></Fax>\012 </Geschaedigter>\012 <Schadendaten>\012 <SchadenDatum>2005-01-23</SchadenDatum>\012 <SchadenNr>JR/41123-208/JPS</SchadenNr>\012 <GeschaetzteSchadenhoehe></GeschaetzteSchadenhoehe>\012 <SchadenAmFz>Dommages aux installations routi\303\250res</SchadenAmFz>\012 <Bemerkung></Bemerkung>\012 </Schadendaten>\012 <Fahrzeugstandort>\012 <Name></Name>\012 <Vorname></Vorname>\012 <Strasse></Strasse>\012 <Land></Land>\012 <PLZ></PLZ>\012 <Ort></Ort>\012 <Telefon></Telefon>\012 <Fax></Fax>\012 <Email></Email>\012 </Fahrzeugstandort>\012</DispatchAuftrag>\012"

Newlines don't seem to be handled properly, and I'm unable to find out how to change that. UTF8 encoding is not good either.

Any idea how to correct that?

Thanks!

Philippe

-------------------------------------------------------------
Attik System web : http://www.attiksystem.ch
Philippe Lang phone : +41 26 422 13 75
rte de la Fonderie 2 gsm : +41 79 351 49 94
1700 Fribourg twitter: @philippelang
pgp : http://keyserver.pgp.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Philippe Lang <philippe(dot)lang(at)attiksystem(dot)ch>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Convert string to UNICODE & migration FROM 7.4 to 9.1
Date:	2011-11-24 20:25:50
Message-ID:	28552.1322166350@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Philippe Lang <philippe(dot)lang(at)attiksystem(dot)ch> writes:
> I have a function under PG 7.4 that returns an xml structure, with the content encoded in UNICODE.

Basically, the only reason you got away with that in 7.4 is that 7.4 is
so lax about encodings. In general, in modern releases, all text
strings inside the backend are in the database's specified encoding.
Full stop, no exceptions. If you think you need something different,
you need to rethink your assumptions.

I'd try just dropping the convert() step and see what happens.

regards, tom lane