Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

From: "J(dot) Michael Crawford" <jmichael(at)gwi(dot)net>
To: pgsql-general(at)postgresql(dot)org, pgsql-jdbc(at)postgresql(dot)org
Subject: Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.
Date: 2004-11-08 16:07:34
Message-ID: 6.1.2.0.2.20041108105839.02fdcfe0@pop.suscom-maine.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-jdbc


I am posting this to the General and JDBC groups so that it's in the
archive -- I know many people have had problems with Latin1 characters, and
I'd like to share what has worked for us. If anyone can add this
information to a more permanent FAQ, I'd be much obliged.

---

Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

This is a general primer for using postgres with alternate character
sets. For those who have done so successfully, the process is obvious in
hindsight. For those who haven't gotten it to work yet, it can be very
complex, especially with web applications where four separate encodings can
come into play.

Postgres is quite intelligent in dealing with character sets, but even
so, it's possible to get into a situation where your application returns
garbage for Latin1 or other extended characters (n with a tilde,
etc.). This is particularly true in Java-based web applications.

There are four steps to ensuring proper character retrieval, with the
fourth applying to Java web applications. They are explained briefly here,
and then in more detail further down in the document:

1. Use a database encoding that will store the appropriate characters.

2. Set the client encoding first, before importing or adding data to
the database.

3. Set the client encoding for each session where a user or program
accesses the database.

4. Set the proper character encoding for the web page.

Do these four steps, and things should work out nicely. Miss even one
of them, and your application will mysteriously return goofy extended
characters.

For anyone who has already done this, the solution is probably
obvious. However, it took us months to figure this out through
trial-and-error, even with the help of the Java and Postgres
communities. After all, we had five different potential encodings to deal
with, in four separate steps, and that yields a LOT of non-working
combinations.

We found dozens of people with the same problem, but most of them never
fully resolved their problems. We tried a heck of a lot of different
strategies, but the only thing that worked was breaking the situation down
into these steps, and finding the right encoding combination for each step,
one-at-a-time. Now that our application works, we want to help minimize
the number of people who have to reinvent this wheel from scratch.

The rest of this document explains the steps in more detail.

---

1. Create the database with the proper encoding.

In our experience, the best thing to do is simply choose Unicode as the
database's character encoding if you think there's a chance of storing
Latin1 or other characters.

You could choose Latin1, and this should work in most cases. However,
there are times when normal-looking characters refuse to be stored in a
Latin1 database, such as character 239, which is the same in Latin1 as it
is is Utf-8. Rather than attempt to beat our heads against this wall, we
went with Unicode because it will hold whatever we need to hold.

---

2. Set client encoding before importing or adding data to the database.

If you do a mass import from another database, or a data retore from
Postgres, make sure the encoding is set first. If you try to restore a
unicode characters to a Latin1 database, the data likely won't look
right. Likewise, if you try to restore Latin1 characters into a Unicode
database without first telling Postgres that it's getting Latin1
characters, the extended characters will come back as garbage,
*even*though* a unicode database can hold them.

The command to set the client encoding for a session is:

"SET CLIENT_ENCODING TO Unicode", "SET CLIENT_ENCODING TO Latin1", and
so forth.

If you find out what kind of characters are coming into the postgres
database, and tell Postgres ahead of time, it should be able to
automatically translate them. As it is designed to do, of course. Just
remember that unicode is a HUGE character set, and there will be unicode
characters that can't be translated into the much smaller Latin1 character set.

You may have to experiment to find out what the right client encoding
is. It will likely depend upon the database from which you're getting
data, such as a backup of a Latin1 postgres database, or data retrieved
from a SQL Server database, which will have its own encoding. It could
also depend upon the character set used by your operating system, such as
the character set of data someone entered into a Microsoft Access database
(which, if they used the US English Windows, will be cp1252). Fortunately
a Unicode or Latin1 client encoding will handle most Windows-related
encoding issues.

---

3. Set the client encoding for each session where a user or program
accesses the database.

If you have users entering data from a Java application, the encoding
will vary from JVM to JVM. A windows-based JVM might have an encoding of
cp1252, while a Linus JVM might be utf-8. A windows-database application
will likely ave a cp1252 or Latin1 character set, but that could depend
upon what application you're using or what version of Windows you're using.

Postgres needs to know what kind of characters to store and return. If
you tell Postgres to return unicode characters to an application that's
using Latin1, extended characters will look like gibberish.

Even in Java, where you can do all sorts of character-encoding
translation, it can be impossible to translate data retrieved from Postgres
if it's in the wrong encoding. We've tried changing the JVM encoding,
altering the jdbc driver, translating encodings on the database read, and
translating encodings after the read while building a new string, to no
avail. We tried 25 combinations of each strategy (five different possible
read encodings and five different possible string encodings), and nothing
worked. We could get an application working in one JVM with one encoding,
but another JVM would break, and no amount of translation would help.

But when we finally told Postgres what to return, everythign worked like
a charm.

Just as with step two, the key is to use the "SET CLIENT_ENCODING TO
(encoding)" sql command. If you're using an application where you can send
SQL to the server, this is all you need. In something like MS Access,
you'll have to move to a passthrough query. For Java, you'll need to send
a command through JDBC:

String DBEncoding = "Unicode" //use a real encoding, either returned from
the jvm or explicitly stated
PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING
TO '" + DBEncoding + "'");
statement.execute();

---

4. Set the proper character encoding for the web page.

Web browsers also need to know what kind of characters they are
displaying. If you send them Unicode characters and the browser thinks
it's getting Latin1, extended characters will again come up as
nonsense. If you're writing a web-based application, it's a good idea to
include the character set as part of the html, such as:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For example, we have an application that serves up pages perfectly from
a Linux JVM, which has a utf-8 encoding, and serves up pages with goofy
extended characters from a Windows JVM, which has a cp1252 encoding. The
solution is either to translate the strings in the Windows JVM into unicode
before serving the page, or alternately, tell the web page to use a
different character set.

---

Those are all the basics. Many times you may not have to worry about
one or more of the steps because your application server or JSP generator
may already handle the character set conversion. But if it doesn't, or if
you're writing Servlets and not JSP's, these steps are good ones to
consider if you work with Latin1 characters.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Marc G. Fournier 2004-11-08 16:51:50 Re: RFD: comp.databases.postgresql.general
Previous Message Gary L. Burnore 2004-11-08 16:07:07 Re: RFD: comp.databases.postgresql.general

Browse pgsql-jdbc by date

  From Date Subject
Next Message Kris Jurka 2004-11-08 17:15:19 Re: [JDBC] Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8)
Previous Message Pallav Kalva 2004-11-08 15:01:04 Installing pgbench