From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org> |
Subject: | pg_dump/restore encoding woes |
Date: | 2013-08-26 15:26:52 |
Message-ID: | 521B73BC.3040907@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
pg_dump and pg_restore don't behave very nicely when the client and
server encodings don't match. Below are three issues that arise from
that. All the examples below use a console with a UTF-8 locale, and the
'latin1db' database uses ISO-8859-1 as the database encoding. In that
database, there is a single table called "pöö".
1. pg_dump verbose output
-------------------------
$ pg_dump -d latin1db -Fc -v -f a.backup
...
pg_dump: finding the columns and types of table "p��"
...
When client encoding is not specified explicitly with the -E option, or
PGCLIENTENCODING env variable, the dump is created in the server encoding.
Alexander Law reported this bug about a year ago, see bug #6742:
http://www.postgresql.org/message-id/E1SrOVd-00028F-Sz@wrigleys.postgresql.org.
Now, you can say that it's the user's fault for not specifying
client_encoding correctly, but see #2.
2. pg_dump -t option doesn't work if client_encoding is not set
---------------------------------------------------------------
$ ./pg_dump -d latin1db -Fc -t pöö -f a.backup
pg_dump: No matching tables were found
$ ./pg_dump -d latin1db -Fc -t pöö -f a.backup -E utf-8
(success)
The table name given in the argument is passed to the server without
translation, so client_encoding needs to be set or the server will not
interpret the table name correctly.
Like #1, this is a user-error - he needs to set client_encoding
correctly. Other client programs like vacuumdb have the same problem.
But we could do better. psql sets client_encoding automatically
(client_encoding='auto') based on the locale. Why don't we do the same
in all the client programs?
However, pg_dump is special, because client encoding affects not only
the encoding used to speak to the server, but it also determines how the
resulting dump is encoded. If you have a UTF-8 server, and a LATIN1
console, there is no way to get a UTF-8 encoded dump of a single table
which has non-ASCII characters in its name. There is a good reason to
want to dump in the server encoding regardless of the encoding of the
client: that avoids the costly encoding conversion during the dump, and
very likely another conversion back on restore. (as a convenience, it
would be nice if you could specify "-E server" to mean "same as server
encoding")
The pg_dump -E option just sets client_encoding, but I think it would be
better for -E to only set the encoding used in the dump, and
PGCLIENTENCODING env variable (if set) was used to determine the
encoding of the command-line arguments. Opinions?
3. pg_restore -t option doesn't work if dump is in different encoding
---------------------------------------------------------------------
$ pg_dump -d latin1db -Fc -f a.backup
$ ./pg_restore -t pöö a.backup
(restores nothing)
pg_restore doesn't convert encodings when it matches the table name
given with -t option with the table names in the dump. Hence in above
example, where the dump is in LATIN1 encoding and the console uses a
UTF-8 locale, the table name is not matched even though there is a table
with that name in the dump.
Unfortunately I don't see any easy solution to this third issue :-(. We
don't have any infrastructure to do encoding conversions in the client.
I guess we could use iconv(3) if it's available, but I'm a bit reluctant
to start using that, given that we've managed to do with out client-side
conversions this far. Or we could do the conversion in the server using
"convert_from()", but that only works if pg_restore is connected to a
database. Perhaps it's best to just throw a warning in if -t is used and
the dump's encoding doesn't match the current locale.
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2013-08-26 15:59:02 | Re: pg_dump/restore encoding woes |
Previous Message | Fujii Masao | 2013-08-26 13:48:46 | Re: pg_system_identifier() |