libpq5 8.3 breaks 8.2 compatibility with encodings

Lists: pgsql-bugs
From: Martin Pitt <martin(at)piware(dot)de>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 14:33:38
Message-ID: 20071012143338.GH5911@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi PostgreSQL developers,

I'm currently hunting down the last postgresql-common test case
failure that I see with 8.3beta1. It seems the 8.3 version of libpq
changes some internal encoding lists?

If I use the 8.2 programs with the 8.2 library, all is well:

$ LC_ALL=en_US.UTF-8 /usr/lib/postgresql/8.2/bin/initdb --encoding UTF8 -D /tmp/x
[...]
The database cluster will be initialized with locale en_US.UTF-8.
[...]
$ /usr/lib/postgresql/8.2/bin/postgres -D /tmp/x -k /tmp &
$ /usr/lib/postgresql/8.2/bin/psql -Alth /tmp
postgres|martin|UTF8
template0|martin|UTF8
template1|martin|UTF8

However, if I use 8.2 programs with the 8.3 library, things start to
become weird:

$ # kill postgres instance
$ rm -rf /tmp/x; LC_ALL=en_US.UTF-8 /usr/lib/postgresql/8.2/bin/initdb --encoding UTF8 -D /tmp/x
[...]
The database cluster will be initialized with locale en_US.UTF-8.
initdb: warning: encoding mismatch
The encoding you selected (UTF8) and the encoding that the selected
locale uses (UTF-8) are not known to match. This may lead to
misbehavior in various character string processing functions. To fix
this situation, rerun initdb and either do not specify an encoding
explicitly, or choose a matching combination.
[...]
$ /usr/lib/postgresql/8.2/bin/postgres -D /tmp/x -k /tmp &
$ /usr/lib/postgresql/8.2/bin/psql -Alth /tmp
postgres|martin|JOHAB
template0|martin|JOHAB
template1|martin|JOHAB

In the latter configuration, when I do not explicitly specify an
encoding, the initdb output still looks weird, but at least the
result seems to be correct:

$ rm -rf /tmp/x; LC_ALL=en_US.UTF-8 /usr/lib/postgresql/8.2/bin/initdb -D /tmp/x
[...]
The database cluster will be initialized with locale en_US.UTF-8.
The default database encoding has accordingly been set to MULE_INTERNAL.
[...]
$ /usr/lib/postgresql/8.2/bin/postgres -D /tmp/x -k /tmp &
$ /usr/lib/postgresql/8.2/bin/psql -Alth /tmp
postgres|martin|UTF8
template0|martin|UTF8
template1|martin|UTF8

This is a bit unfortunate, since it breaks ABI compatibility without
announcing it in the SONAME. From the previous discussion it is quite
clear that a soname bump is a pain, so could this be changed somehow
to accomodate new encodings while remaining binary compatibility with
earlier releases?

Thanks,

Martin
--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org


From: Martin Pitt <martin(at)piware(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 15:14:46
Message-ID: 20071012151446.GI5911@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi,

Martin Pitt [2007-10-12 16:33 +0200]:
> I'm currently hunting down the last postgresql-common test case
> failure that I see with 8.3beta1. It seems the 8.3 version of libpq
> changes some internal encoding lists?

Ah, got it. The ordering in pg_enc2name_tbl[] changed, which makes the
indices jump around. This was introduced in [1], in particular in
those two bits:

http://developer.postgresql.org/cvsweb.cgi/pgsql/src/include/mb/pg_wchar.h.diff?r1=1.71;r2=1.72
http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/mb/encnames.c.diff?r1=1.32;r2=1.33

With attached patch (which restores the previous ordering)
compatibility with 8.2 is restored. This has two drawbacks:

* The enum cannot be nicely sorted by internal and client-only
encodings until libpq bumps soname again. This is only a cosmetical
problem, though.

* This patch needs another catalog bump (to "unbump" the
one in [1]). That's unfortunate, but the catalog number got bumped
in between beta and release in earlier versions, too, so I hope
it's not too bad.

The pg_enc2name_tbl declaration should probably have a comment saying
to never alter the order, but only append new stuff at the end. For
encodings which became obsolete (should that happen) there should be
an constant like "INVALID" or "DEPRECATED".

Thank you!

Martin

[1] http://archives.postgresql.org/pgsql-committers/2007-04/msg00198.php

--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org

Attachment Content-Type Size
13-libpq8.2-compat.patch text/x-diff 2.0 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martin Pitt <martin(at)piware(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 15:50:35
Message-ID: 28072.1192204235@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Martin Pitt <martin(at)piware(dot)de> writes:
> Ah, got it. The ordering in pg_enc2name_tbl[] changed, which makes the
> indices jump around.

Sorry, you don't get to put JOHAB back into the portion of the list that
is backend-legal encodings.

It's a bit nasty that this enum is exposed as part of the ABI, but I'm
afraid we may be stuck with that decision.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martin Pitt <martin(at)piware(dot)de>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 16:02:44
Message-ID: 28378.1192204964@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Martin Pitt <martin(at)piware(dot)de> writes:
> However, if I use 8.2 programs with the 8.3 library, things start to
> become weird:

> $ # kill postgres instance
> $ rm -rf /tmp/x; LC_ALL=3Den_US.UTF-8 /usr/lib/postgresql/8.2/bin/initdb =
> --encoding UTF8 -D /tmp/x

Does anything other than initdb get weird?

For the most part I believe it's the case that libpq's idea of the enum
values is independent of the backend's. I think the issue here is that
initdb is (mis) using libpq's pg_char_to_encoding, etc, and combining
those functions with its own idea of the meanings of the enum values.

Maybe we should stop exporting pg_char_to_encoding and so on from libpq,
though I wonder if that would break any clients.

regards, tom lane


From: Martin Pitt <martin(at)piware(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 16:18:45
Message-ID: 20071012161845.GK5911@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi,

Tom Lane [2007-10-12 11:50 -0400]:
> Martin Pitt <martin(at)piware(dot)de> writes:
> > Ah, got it. The ordering in pg_enc2name_tbl[] changed, which makes the
> > indices jump around.
>
> Sorry, you don't get to put JOHAB back into the portion of the list that
> is backend-legal encodings.

Ah, the PG_ENCODING_BE_LAST magic.

> It's a bit nasty that this enum is exposed as part of the ABI, but I'm
> afraid we may be stuck with that decision.

Well, then I see two options for 8.3:

(1) Change the PG_ENCODING_IS_CLIENT_ONLY and PG_VALID_BE_ENCODING
macros to expliticy disallow encodings which have become client-only
while soname is not bumped. This is a bit ugly, but should work until
the table gets restructured to have a per-locale flag of
internal/clientonly, or the mapping stops being index-based.

I'm happy to check all 9 other places where pg_enc is used for
whether they need adaptions for dropped JOHAB (i. e. make assumptions
about the structure without using above macros).

(2) Bump the soname. That's definitively a huge PITA for
distributors, but it's still better than silently breaking the ABI.

So, with my distro hat on I'd definitively prefer (1), but if you want
(2) for cleanliness' sake, we have to follow and bite the bullet. But
we can't just let it stay like this.

Thank you, and have a good weekend!

Martin

--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martin Pitt <martin(at)piware(dot)de>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 16:20:01
Message-ID: 200710121820.01779.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Am Freitag, 12. Oktober 2007 schrieb Tom Lane:
> Sorry, you don't get to put JOHAB back into the portion of the list that
> is backend-legal encodings.
>
> It's a bit nasty that this enum is exposed as part of the ABI, but I'm
> afraid we may be stuck with that decision.

Perhaps we should leave a gap where JOHAB used to be and add it at the end.
Otherwise we may have to do a soname exercise.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-bugs(at)postgresql(dot)org, Martin Pitt <martin(at)piware(dot)de>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 16:54:27
Message-ID: 29275.1192208067@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Perhaps we should leave a gap where JOHAB used to be and add it at the end.
> Otherwise we may have to do a soname exercise.

The question to me is whether the encoding numbers used by libpq
should be considered anything other than black-box numbers. initdb
is arguably assuming more than it ought to about what they mean.

regards, tom lane


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Martin Pitt" <martin(at)piware(dot)de>, "PostgreSQL Bugs" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 17:00:36
Message-ID: 470FA834.2010908@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> Martin Pitt <martin(at)piware(dot)de> writes:
>> However, if I use 8.2 programs with the 8.3 library, things start to
>> become weird:
>
>> $ # kill postgres instance
>> $ rm -rf /tmp/x; LC_ALL=3Den_US.UTF-8 /usr/lib/postgresql/8.2/bin/initdb =
>> --encoding UTF8 -D /tmp/x
>
> Does anything other than initdb get weird?
>
> For the most part I believe it's the case that libpq's idea of the enum
> values is independent of the backend's. I think the issue here is that
> initdb is (mis) using libpq's pg_char_to_encoding, etc, and combining
> those functions with its own idea of the meanings of the enum values.
>
> Maybe we should stop exporting pg_char_to_encoding and so on from libpq,
> though I wonder if that would break any clients.

I'm in favor not exporting them in the long term, a normal client
program should have no business calling those functions.

Note that we've also added PG_EUC_JIS_2004 and PG_SJIS (client-only) to
the enumeration. That means that all the previous client-only encodings
have been shifted by two.

If we want to keep the functions compatible when possible, we could:
* replace JOHAB in the enum with a deprecated placeholder, and
* move PG_JOHAB to the end of the enum, instead of the shoving it in the
middle of client-only encodings, and
* move PG_SJIS to the end, to shift the rest of the client-only
encodings by one, to compensate the addition of the new PG_EUC_JIS_2004
encoding.

PG_JOHAB and PG_SJIS would have different encoding identifiers than in
8.2, but the rest would stay put.

But OTOH, I feel we should just stop exporting them now; I don't think
there's a legitimate use case for a client application to use them. The
best I can think of is an application that would want to know what
encodings there is, and show them in a menu or something. But even then,
you shouldn't use the numeric values, just the encoding names.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Martin Pitt <martin(at)piware(dot)de>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 17:08:19
Message-ID: 20071012170819.GL5911@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi,

Tom Lane [2007-10-12 12:02 -0400]:
> Does anything other than initdb get weird?

It's hard to tell, my test suite concentrates on hammering initdbs
with various locales, encodings, getting a chain of 7.4->8.{0,1,2,3}
upgrades and testing the conversion of postgresql.conf arguments, etc.
I do not do that much of locale juggling (only some particular tests
to check for the infamous CVE-2006-2313/4).

I'm just afraid there might be other lurking regressions. I can do
some tests with psql and set client_encoding, etc.

> For the most part I believe it's the case that libpq's idea of the enum
> values is independent of the backend's. I think the issue here is that
> initdb is (mis) using libpq's pg_char_to_encoding, etc, and combining
> those functions with its own idea of the meanings of the enum values.
>
> Maybe we should stop exporting pg_char_to_encoding and so on from libpq,
> though I wonder if that would break any clients.

Hm, at least that sounds like a good method to find out what other
parts of the code use this array directly.

Thanks,

Martin

--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martin Pitt" <martin(at)piware(dot)de>, "PostgreSQL Bugs" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 17:09:20
Message-ID: 470FAA40.4010505@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Heikki Linnakangas wrote:
> Note that we've also added PG_EUC_JIS_2004 and PG_SJIS (client-only) to
> the enumeration. That means that all the previous client-only encodings
> have been shifted by two.

On closer look, PG_SJIS is not new, PG_SHIFT_JIS-2004 is. But that was
added to the end, so the above should be "shifted by one", not two.

The rest of the mail got that right.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Martin Pitt" <martin(at)piware(dot)de>, "PostgreSQL Bugs" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 17:23:12
Message-ID: 29815.1192209792@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> But OTOH, I feel we should just stop exporting them now; I don't think
> there's a legitimate use case for a client application to use them.

Actually, that doesn't seem workable as a solution to our immediate
problem: we surely can't remove exported symbols without doing a soname
bump.

Also, in general it seems like a bad idea to try to freeze the values of
the enum: what happens when we want to add some more server-side
encodings? We will never be able to do that without moving the
client-only ones.

I think what we must do here is decouple libpq's values of the enum from
the backend's. As long as a client talking to libpq does not have any
preconceived notions about what the numbers mean, it's fine. (And in
that mindset, we *do* need to export pg_encoding_to_char and so on,
so that clients can use those functions to map to encoding names instead
of making it up themselves.)

I'm becoming more and more convinced that this is initdb's bug not
libpq's. The problem stems from initdb using libpq's functions and
assuming that its numbers match up with pg_wchar.h. But note that
pg_wchar.h is not exported by libpq.

regards, tom lane


From: Martin Pitt <martin(at)piware(dot)de>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 17:38:21
Message-ID: 20071012173821.GM5911@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi,

Tom Lane [2007-10-12 13:23 -0400]:
> I'm becoming more and more convinced that this is initdb's bug not
> libpq's. The problem stems from initdb using libpq's functions and
> assuming that its numbers match up with pg_wchar.h. But note that
> pg_wchar.h is not exported by libpq.

Sounds convincing. The hard part is that this then also a bug in 8.2's
initdb, which cannot be changed, so at least for this case we'll need
a workaround.

Martin

--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martin Pitt <martin(at)piware(dot)de>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 17:54:14
Message-ID: 371.1192211654@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Martin Pitt <martin(at)piware(dot)de> writes:
> Sounds convincing. The hard part is that this then also a bug in 8.2's
> initdb, which cannot be changed,

Why not? Also, is there really a bug in 8.2 or before? chklocale.c
is new in 8.3.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martin Pitt <martin(at)piware(dot)de>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: libpq5 8.3 breaks 8.2 compatibility with encodings
Date: 2007-10-12 21:46:45
Message-ID: 5586.1192225605@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

OK, I spent some time analyzing this problem, and I think there's a
fairly simple way to avoid a libpq ABI break between 8.2 and 8.3.
Although we removed PG_JOHAB as a backend encoding, we added
PG_EUC_JIS_2004. Therefore, if we rearrange things to give
PG_EUC_JIS_2004 the old number of PG_JOHAB, and make sure that PG_JOHAB
and the other new frontend-only encodings are at the end, we will have
numerical compatibility for everything except PG_JOHAB itself --- and
even there the only apparent discrepancy will be in whether
pg_valid_server_encoding thinks it's a backend encoding or not.
(An 8.3 libpq will correctly report that it's not, where 8.2 thought
it was.) So this seems about as close as we can reasonably get.

Moving forward, we definitely want to decouple the libpq and backend
interpretations of the encoding enum so that we won't get burnt like
this again. I propose that we fix things so that henceforth no libpq
client is expected to import pg_wchar.h, and therefore can't have any
hardwired knowledge of the meaning of any given encoding number.
This would primarily mean acknowledging pg_encoding_to_char,
pg_char_to_encoding, and pg_valid_server_encoding as official parts of
libpq's API, by declaring them in libpq-fe.h. I'd also want to fix
initdb so that it doesn't link to libpq.so at all, but compiles its own
copies of the necessary files, thereby ensuring that its numbering is
compatible with chklocale.c (which is also statically linked into
initdb) and with the matching postgres executable.

The remaining stumbling block is psql/mbprint.c, which among other
nastiness has compiled-in dependencies on the numerical value of
PG_UTF8. I'm of the opinion that that entire file is misconceived;
at the very least it needs to be moved someplace else. That however
is more work than I want to tackle for 8.3 --- for one thing, the only
obvious place to put it is libpq.so, which we couldn't do without a
soname bump. So what I propose doing for now is allowing it to continue
to depend on pg_wchar.h, and inserting into the latter a caution that
PG_UTF8 mustn't be renumbered until the mbprint situation is cleaned up.

Comments, better ideas?

regards, tom lane