Re: pg_dump/restore encoding woes

Lists: pgsql-hackers
From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: pg_dump/restore encoding woes
Date: 2013-08-26 15:26:52
Message-ID: 521B73BC.3040907@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

pg_dump and pg_restore don't behave very nicely when the client and
server encodings don't match. Below are three issues that arise from
that. All the examples below use a console with a UTF-8 locale, and the
'latin1db' database uses ISO-8859-1 as the database encoding. In that
database, there is a single table called "pöö".

1. pg_dump verbose output
-------------------------

$ pg_dump -d latin1db -Fc -v -f a.backup

...
pg_dump: finding the columns and types of table "p��"
...

When client encoding is not specified explicitly with the -E option, or
PGCLIENTENCODING env variable, the dump is created in the server encoding.

Alexander Law reported this bug about a year ago, see bug #6742:
http://www.postgresql.org/message-id/E1SrOVd-00028F-Sz@wrigleys.postgresql.org.

Now, you can say that it's the user's fault for not specifying
client_encoding correctly, but see #2.

2. pg_dump -t option doesn't work if client_encoding is not set
---------------------------------------------------------------

$ ./pg_dump -d latin1db -Fc -t pöö -f a.backup
pg_dump: No matching tables were found

$ ./pg_dump -d latin1db -Fc -t pöö -f a.backup -E utf-8
(success)

The table name given in the argument is passed to the server without
translation, so client_encoding needs to be set or the server will not
interpret the table name correctly.

Like #1, this is a user-error - he needs to set client_encoding
correctly. Other client programs like vacuumdb have the same problem.
But we could do better. psql sets client_encoding automatically
(client_encoding='auto') based on the locale. Why don't we do the same
in all the client programs?

However, pg_dump is special, because client encoding affects not only
the encoding used to speak to the server, but it also determines how the
resulting dump is encoded. If you have a UTF-8 server, and a LATIN1
console, there is no way to get a UTF-8 encoded dump of a single table
which has non-ASCII characters in its name. There is a good reason to
want to dump in the server encoding regardless of the encoding of the
client: that avoids the costly encoding conversion during the dump, and
very likely another conversion back on restore. (as a convenience, it
would be nice if you could specify "-E server" to mean "same as server
encoding")

The pg_dump -E option just sets client_encoding, but I think it would be
better for -E to only set the encoding used in the dump, and
PGCLIENTENCODING env variable (if set) was used to determine the
encoding of the command-line arguments. Opinions?

3. pg_restore -t option doesn't work if dump is in different encoding
---------------------------------------------------------------------

$ pg_dump -d latin1db -Fc -f a.backup
$ ./pg_restore -t pöö a.backup
(restores nothing)

pg_restore doesn't convert encodings when it matches the table name
given with -t option with the table names in the dump. Hence in above
example, where the dump is in LATIN1 encoding and the console uses a
UTF-8 locale, the table name is not matched even though there is a table
with that name in the dump.

Unfortunately I don't see any easy solution to this third issue :-(. We
don't have any infrastructure to do encoding conversions in the client.
I guess we could use iconv(3) if it's available, but I'm a bit reluctant
to start using that, given that we've managed to do with out client-side
conversions this far. Or we could do the conversion in the server using
"convert_from()", but that only works if pg_restore is connected to a
database. Perhaps it's best to just throw a warning in if -t is used and
the dump's encoding doesn't match the current locale.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-08-26 15:59:02
Message-ID: 66480.1377532742@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> When client encoding is not specified explicitly with the -E option, or
> PGCLIENTENCODING env variable, the dump is created in the server encoding.

Yeah, that's intentional as I recall.

> However, pg_dump is special, because client encoding affects not only
> the encoding used to speak to the server, but it also determines how the
> resulting dump is encoded. If you have a UTF-8 server, and a LATIN1
> console, there is no way to get a UTF-8 encoded dump of a single table
> which has non-ASCII characters in its name. There is a good reason to
> want to dump in the server encoding regardless of the encoding of the
> client: that avoids the costly encoding conversion during the dump, and
> very likely another conversion back on restore. (as a convenience, it
> would be nice if you could specify "-E server" to mean "same as server
> encoding")

There's a considerably more compelling reason than speed to default to
avoiding a conversion: doing a conversion carries significant risk of
outright failure, due to not being able to convert some data character
to the client character set.

> The pg_dump -E option just sets client_encoding, but I think it would be
> better for -E to only set the encoding used in the dump, and
> PGCLIENTENCODING env variable (if set) was used to determine the
> encoding of the command-line arguments. Opinions?

I think this is going to be a lot easier said than done, but feel free
to see if you can make it work. (As you point out, we don't have
any client-side encoding conversion infrastructure, but I don't see
how you're going to make this work without it.)

A second issue is whether we should divorce -E and PGCLIENTENCODING like
that, when they have always meant the same thing. You mentioned the
alternative of looking at pg_dump's locale environment to determine the
command line encoding --- would that be better?

regards, tom lane


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-08-27 14:36:51
Message-ID: 521CB983.1050509@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26.08.2013 18:59, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas(at)vmware(dot)com> writes:
>> The pg_dump -E option just sets client_encoding, but I think it would be
>> better for -E to only set the encoding used in the dump, and
>> PGCLIENTENCODING env variable (if set) was used to determine the
>> encoding of the command-line arguments. Opinions?
>
> I think this is going to be a lot easier said than done, but feel free
> to see if you can make it work. (As you point out, we don't have
> any client-side encoding conversion infrastructure, but I don't see
> how you're going to make this work without it.)

First set client_encoding to PGCLIENTENCODING (ie. let libpq do its
default thing), and run the queries that fetch the OIDs of any matching
tables. Only after doing that, set client_encoding to that specified by
-E. That's actually pretty easy to do, as pg_dump already issues all the
queries that include user-given strings first.

There's one small wrinkle in that: if the dump fails because the server
sends an error, the error would come from the server in the -E encoding,
because that's used as the client_encoding after the initial queries.
Logically, the errors should be printed in PGCLIENTENCODING. But I think
we can live with that.

The pg_restore part is more difficult, as pg_restore needs to work
without a database connection at all. There the conversion has to be
done client-side.

> A second issue is whether we should divorce -E and PGCLIENTENCODING like
> that, when they have always meant the same thing. You mentioned the
> alternative of looking at pg_dump's locale environment to determine the
> command line encoding --- would that be better?

Hmm. I wasn't thinking of looking at the locale environment as an
alternative to the divorce, just as a way to determine the default for
the client encoding if PGCLIENTENCODING is not set.

It would be good for pg_dump to be consistent with other tools
(reindexdb etc.). If we say that LC_CTYPE determines the encoding of
command-line options, regardless of PGCLIENTENCODING, we should do the
same in other tools, too. And that would be weird for the other tools.
So no, I don't think we should do that.

My plan is to assume that the command-line options are in the "client
encoding", and the client encoding is determined by the following
things, in this order:

1. client_encoding setting given explicitly in the command line, as part
of the connection string.
2. PGCLIENTENCODING
3. LC_CTYPE
4. same as server_encoding

The encoding to be used in the pg_dump output is a different concept,
and there are reasons to *not* want the dump to be in the client
encoding. Hence using server_encoding for that would be a better default
than the current client encoding. This isn't so visible today, as
client_encoding defaults to server_encoding anyway, but it will be if we
set client_encoding='auto' (ie. look at LC_CTYPE) by default.

There are certainly cases where you'd want to use the client encoding as
the dump output encoding. For example if you pipe the pg_dump output to
some custom script that mangles it. Or if you just want to look at the
output, ie. if the output goes to a terminal. But for the usual case of
taking a backup, using the server encoding makes more sense. One option
is to check if the output is a tty (like psql does), and output the dump
in client or server encoding depending on that (if -E is not given).

So yeah, I think we should divorce -E and PGCLIENTENCODING. It feels
that using PGCLIENTENCODING to specify the output encoding was more
accidental than on purpose. Looking at the history, the implementation
has been that way forever, but the docs were adjusted by commit b524cb36
in 2005 to document that fact.

Here is a set of three patches that I've been working on:

0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch

Separates pg_dump -E from PGCLIENTENCODING.

0002-Set-client_encoding-auto-in-all-client-utilities.patch

Set client_encoding='auto' in all the client utilities, like we already
did in psql. This fixes e.g "reindexdb -t" with a table name with
non-ASCII chars

0003-Convert-object-names-to-archive-encoding-before-matc.patch

Use iconv(3) in pg_restore to do encoding conversion in the client. This
involves a lot of autoconf changes that I'm not 100% sure about, other
than that it's pretty straightforward.

- Heikki

Attachment Content-Type Size
0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch text/x-diff 6.6 KB
0002-Set-client_encoding-auto-in-all-client-utilities.patch text/x-diff 3.1 KB
0003-Convert-object-names-to-archive-encoding-before-matc.patch text/x-diff 115.6 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-08-27 15:03:58
Message-ID: 521CBFDE.5080508@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 08/27/2013 10:36 AM, Heikki Linnakangas wrote:
> 0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch
>
> Separates pg_dump -E from PGCLIENTENCODING.
>
>

Wouldn't it be better to do this another way? Separating these two will
be confusing, to say the least, as well as inconsistent with what os
done elsewhere. Why not provide a new option that specifically allows a
client encoding that only applies during the query phase?

cheers

andrew


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-08-27 15:14:24
Message-ID: 521CC250.6040101@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27.08.2013 18:03, Andrew Dunstan wrote:
>
> On 08/27/2013 10:36 AM, Heikki Linnakangas wrote:
>> 0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch
>>
>> Separates pg_dump -E from PGCLIENTENCODING.
>
> Wouldn't it be better to do this another way? Separating these two will
> be confusing, to say the least, as well as inconsistent with what os
> done elsewhere.

What would it be inconsistent with? There is no -E option in other
client tools, pg_dump is unique in that. initdb does have a -E option,
but that *is* separate from PGCLIENTENCODING, so if anything the current
situation is inconsistent.

> Why not provide a new option that specifically allows a
> client encoding that only applies during the query phase?

Well, that would be different from all the other client programs. And
doesn't seem any less confusing to me.

- Heikki


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-08-27 16:12:24
Message-ID: 521CCFE8.5050604@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 08/27/2013 11:14 AM, Heikki Linnakangas wrote:
> On 27.08.2013 18:03, Andrew Dunstan wrote:
>>
>> On 08/27/2013 10:36 AM, Heikki Linnakangas wrote:
>>> 0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch
>>>
>>> Separates pg_dump -E from PGCLIENTENCODING.
>>
>> Wouldn't it be better to do this another way? Separating these two will
>> be confusing, to say the least, as well as inconsistent with what os
>> done elsewhere.
>
> What would it be inconsistent with? There is no -E option in other
> client tools, pg_dump is unique in that. initdb does have a -E option,
> but that *is* separate from PGCLIENTENCODING, so if anything the
> current situation is inconsistent.

Yeah, you're right, I was probably thinking of initdb, although it
doesn't so much separate these as ignore PGCLIENTENCODING completely.

I guess I'm mainly concerned that we're going to make one of these do
something different, and it will be hard to remember which is which, at
least for me (brain cells are no doubt dying at an ever increasing rate
as I approach my seventh decade.)

cheers

andrew


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-08-28 04:14:39
Message-ID: 1377663279.14126.5.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2013-08-27 at 17:36 +0300, Heikki Linnakangas wrote:
> Here is a set of three patches that I've been working on:

There is a compiler warning:

pg_backup_db.c:243:0: warning: "PARAMS_ARRAY_SIZE" redefined [enabled by default]
pg_backup_db.c:139:0: note: this is the location of the previous definition


From: Amit Khandekar <amit(dot)khandekar(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-09-25 07:19:55
Message-ID: CACoZds2K+imiraAZNx85eb4ibxnK2EJ3mNJJ=5RP=XAh2HaWGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 August 2013 20:06, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> On 26.08.2013 18:59, Tom Lane wrote:
>
>> Heikki Linnakangas<hlinnakangas(at)**vmware(dot)com <hlinnakangas(at)vmware(dot)com>>
>> writes:
>>
>> The pg_dump -E option just sets client_encoding, but I think it would be
>>> better for -E to only set the encoding used in the dump, and
>>> PGCLIENTENCODING env variable (if set) was used to determine the
>>> encoding of the command-line arguments. Opinions?
>>>
>>
>> I think this is going to be a lot easier said than done, but feel free
>> to see if you can make it work. (As you point out, we don't have
>> any client-side encoding conversion infrastructure, but I don't see
>> how you're going to make this work without it.)
>>
>
> First set client_encoding to PGCLIENTENCODING (ie. let libpq do its
> default thing), and run the queries that fetch the OIDs of any matching
> tables. Only after doing that, set client_encoding to that specified by -E.
> That's actually pretty easy to do, as pg_dump already issues all the
> queries that include user-given strings first.
>
> There's one small wrinkle in that: if the dump fails because the server
> sends an error, the error would come from the server in the -E encoding,
> because that's used as the client_encoding after the initial queries.
> Logically, the errors should be printed in PGCLIENTENCODING. But I think we
> can live with that.
>
> The pg_restore part is more difficult, as pg_restore needs to work without
> a database connection at all. There the conversion has to be done
> client-side.
>
>
> A second issue is whether we should divorce -E and PGCLIENTENCODING like
>> that, when they have always meant the same thing. You mentioned the
>> alternative of looking at pg_dump's locale environment to determine the
>> command line encoding --- would that be better?
>>
>
> Hmm. I wasn't thinking of looking at the locale environment as an
> alternative to the divorce, just as a way to determine the default for the
> client encoding if PGCLIENTENCODING is not set.
>
> It would be good for pg_dump to be consistent with other tools (reindexdb
> etc.). If we say that LC_CTYPE determines the encoding of command-line
> options, regardless of PGCLIENTENCODING, we should do the same in other
> tools, too. And that would be weird for the other tools. So no, I don't
> think we should do that.
>
> My plan is to assume that the command-line options are in the "client
> encoding", and the client encoding is determined by the following things,
> in this order:
>
> 1. client_encoding setting given explicitly in the command line, as part
> of the connection string.
> 2. PGCLIENTENCODING
> 3. LC_CTYPE
> 4. same as server_encoding
>
> The encoding to be used in the pg_dump output is a different concept, and
> there are reasons to *not* want the dump to be in the client encoding.
> Hence using server_encoding for that would be a better default than the
> current client encoding. This isn't so visible today, as client_encoding
> defaults to server_encoding anyway, but it will be if we set
> client_encoding='auto' (ie. look at LC_CTYPE) by default.
>
> There are certainly cases where you'd want to use the client encoding as
> the dump output encoding. For example if you pipe the pg_dump output to
> some custom script that mangles it. Or if you just want to look at the
> output, ie. if the output goes to a terminal. But for the usual case of
> taking a backup, using the server encoding makes more sense. One option is
> to check if the output is a tty (like psql does), and output the dump in
> client or server encoding depending on that (if -E is not given).
>
> So yeah, I think we should divorce -E and PGCLIENTENCODING. It feels that
> using PGCLIENTENCODING to specify the output encoding was more accidental
> than on purpose. Looking at the history, the implementation has been that
> way forever, but the docs were adjusted by commit b524cb36 in 2005 to
> document that fact.
>
> Here is a set of three patches that I've been working on:
>
> 0001-Divorce-pg_dump-E-option-**from-PGCLIENTENCODING.patch
>
> Separates pg_dump -E from PGCLIENTENCODING.
>

Since AH->encoding is now used only to represent dump encoding, we should
rename it as AH->dump_encoding.

-----

The standard_conforming_strings parameter is currently set in
setup_connection. You have moved it in setup_dump_encoding(). Is it in any
way related to encoding ? If not, I think we should keep it in
setup_connection().

-----

If a user supplies pg_dump --role=ROLENAME, we do the SET-ROLE in
setup_connection. The role name is in client encoding. In case of parallel
pg_dump, the setup_connection() in the parent process does it right, but in
the worker processes, the client_encoding is already set to the
dump_encoding when -E is given (parent has already updated it's
client_encoding to dump_encoding and I think the child inherits
client_encoding from parent), and then in the child when SET-ROLE is called
through setup_connection, it does not work because role name is in client
encoding, not dump encoding.

$ pg_dump -t "pöö" -E LATIN1 --role=uöö postgres
# Above succeeds

$ `which pg_dump` -t "pöö" -E LATIN1 -j 5 -f dumpdir -Fd postgres
--role=uöö
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [parallel archiver] query was: SET ROLE "uöö"
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist

>
> 0002-Set-client_encoding-auto-**in-all-client-utilities.patch

> Set client_encoding='auto' in all the client utilities, like we already
> did in psql. This fixes e.g "reindexdb -t" with a table name with non-ASCII
> chars
>
>
This patch looks good, I don't have any issues with this.

> 0003-Convert-object-names-to-**archive-encoding-before-matc.**patch
>
> Use iconv(3) in pg_restore to do encoding conversion in the client. This
> involves a lot of autoconf changes that I'm not 100% sure about, other than
> that it's pretty straightforward.

I haven't looked into this one yet.

>
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


From: Amit Khandekar <amit(dot)khandekar(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-10-01 09:54:47
Message-ID: CACoZds2sP+yZ_s707kzaeYdVXB1XnUDMMeRkEn7M0+sBgwVwQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25 September 2013 12:49, Amit Khandekar
<amit(dot)khandekar(at)enterprisedb(dot)com>wrote:

>
> 0003-Convert-object-names-to-**archive-encoding-before-matc.**patch
>>
>> Use iconv(3) in pg_restore to do encoding conversion in the client. This
>> involves a lot of autoconf changes that I'm not 100% sure about, other than
>> that it's pretty straightforward.
>
>
> I haven't looked into this one yet.
>

I have verified that the *.m4 files that you have included match the ones
available through gettext. In general, the iconv-related config changes
look good.

---------------

I wanted to compare the new configure file that is included in the patch
with one that I tried generating myself from the new configure.in file
modified by your patch. It gives me :
Autoconf version 2.63 is required.
My autoconf is 2.68. Do we have to use 2.63 only ?

---------------
pg_restore.c:
cli_encoding can be declared in the if block.

---------------
pg_restore.c:
get_command_line_encoding() is called twice here :
cli_encoding = get_command_line_encoding();
if (cli_encoding > 0 && AH->encoding != get_command_line_encoding())

---------------
pg_restore.c:
You actually intended to use convert_simple_string_list() rather than
convert_string() below:
convert_string(cd, &opts->schemaNames);
convert_string(cd, &opts->functionNames);
convert_string(cd, &opts->indexNames);
convert_string(cd, &opts->triggerNames);

>
>>
>>
>> - Heikki
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>>
>