Re: pg_dump/restore encoding woes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-08-27 14:36:51
Message-ID: 521CB983.1050509@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26.08.2013 18:59, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas(at)vmware(dot)com> writes:
>> The pg_dump -E option just sets client_encoding, but I think it would be
>> better for -E to only set the encoding used in the dump, and
>> PGCLIENTENCODING env variable (if set) was used to determine the
>> encoding of the command-line arguments. Opinions?
>
> I think this is going to be a lot easier said than done, but feel free
> to see if you can make it work. (As you point out, we don't have
> any client-side encoding conversion infrastructure, but I don't see
> how you're going to make this work without it.)

First set client_encoding to PGCLIENTENCODING (ie. let libpq do its
default thing), and run the queries that fetch the OIDs of any matching
tables. Only after doing that, set client_encoding to that specified by
-E. That's actually pretty easy to do, as pg_dump already issues all the
queries that include user-given strings first.

There's one small wrinkle in that: if the dump fails because the server
sends an error, the error would come from the server in the -E encoding,
because that's used as the client_encoding after the initial queries.
Logically, the errors should be printed in PGCLIENTENCODING. But I think
we can live with that.

The pg_restore part is more difficult, as pg_restore needs to work
without a database connection at all. There the conversion has to be
done client-side.

> A second issue is whether we should divorce -E and PGCLIENTENCODING like
> that, when they have always meant the same thing. You mentioned the
> alternative of looking at pg_dump's locale environment to determine the
> command line encoding --- would that be better?

Hmm. I wasn't thinking of looking at the locale environment as an
alternative to the divorce, just as a way to determine the default for
the client encoding if PGCLIENTENCODING is not set.

It would be good for pg_dump to be consistent with other tools
(reindexdb etc.). If we say that LC_CTYPE determines the encoding of
command-line options, regardless of PGCLIENTENCODING, we should do the
same in other tools, too. And that would be weird for the other tools.
So no, I don't think we should do that.

My plan is to assume that the command-line options are in the "client
encoding", and the client encoding is determined by the following
things, in this order:

1. client_encoding setting given explicitly in the command line, as part
of the connection string.
2. PGCLIENTENCODING
3. LC_CTYPE
4. same as server_encoding

The encoding to be used in the pg_dump output is a different concept,
and there are reasons to *not* want the dump to be in the client
encoding. Hence using server_encoding for that would be a better default
than the current client encoding. This isn't so visible today, as
client_encoding defaults to server_encoding anyway, but it will be if we
set client_encoding='auto' (ie. look at LC_CTYPE) by default.

There are certainly cases where you'd want to use the client encoding as
the dump output encoding. For example if you pipe the pg_dump output to
some custom script that mangles it. Or if you just want to look at the
output, ie. if the output goes to a terminal. But for the usual case of
taking a backup, using the server encoding makes more sense. One option
is to check if the output is a tty (like psql does), and output the dump
in client or server encoding depending on that (if -E is not given).

So yeah, I think we should divorce -E and PGCLIENTENCODING. It feels
that using PGCLIENTENCODING to specify the output encoding was more
accidental than on purpose. Looking at the history, the implementation
has been that way forever, but the docs were adjusted by commit b524cb36
in 2005 to document that fact.

Here is a set of three patches that I've been working on:

0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch

Separates pg_dump -E from PGCLIENTENCODING.

0002-Set-client_encoding-auto-in-all-client-utilities.patch

Set client_encoding='auto' in all the client utilities, like we already
did in psql. This fixes e.g "reindexdb -t" with a table name with
non-ASCII chars

0003-Convert-object-names-to-archive-encoding-before-matc.patch

Use iconv(3) in pg_restore to do encoding conversion in the client. This
involves a lot of autoconf changes that I'm not 100% sure about, other
than that it's pretty straightforward.

- Heikki

Attachment Content-Type Size
0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch text/x-diff 6.6 KB
0002-Set-client_encoding-auto-in-all-client-utilities.patch text/x-diff 3.1 KB
0003-Convert-object-names-to-archive-encoding-before-matc.patch text/x-diff 115.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-08-27 14:40:35 error out when building pg_xlogdump with pgxs
Previous Message Atri Sharma 2013-08-27 14:36:21 Re: Valgrind Memcheck support