[bug fix] multibyte messages are displayed incorrectly on the client

Lists: pgsql-hackers
From: "MauMau" <maumau307(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2013-12-13 13:41:17
Message-ID: 60A7F9F5B25B4F1A9888F18E883E5581@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

The attached patch fixes incorrect message output on the client side. I
guess this problem can happen with any major release. Could you review
this?

[Problem]
When the client's locale differs from the server's message locale, the
messages generated on the server are converted appropriately and sent to the
client. For example, if the server runs on Linux with lc_messages =
'ja_JP.UTF-8' in postgresql.conf, and you run psql on Windows where the
system locale is SJIS, Japanese messages are converted from UTF-8 to SJIS on
the server and sent to psql. psql can display those SJIS messages
correctly. This is no problem.

However, this desirable behavior holds true only after the database session
is established. The error messages during session establishment are
displayed incorrectly, and you cannot recognize the message contents. For
example, run psql -d postgres -U non-existent-username. The displayed
message is unrecognizable.

[Cause]
While the session is being established, the server cannot use the client
encoding for message conversion yet, because it cannot access system
catalogs to retrieve conversion functions. So, the server sends messages to
the client without conversion. In the above example, the server sends
Japanese UTF-8 messages to psql, which expects those messages in SJIS.

[Fix]
Disable message localization during session startup. In other words,
messages are output in English until the database session is established.

Regards
MauMau

Attachment Content-Type Size
no_localize_message_in_startup.patch application/octet-stream 4.8 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2013-12-17 18:42:08
Message-ID: 20131217184208.GH19059@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:
> [Cause]
> While the session is being established, the server cannot use the
> client encoding for message conversion yet, because it cannot access
> system catalogs to retrieve conversion functions. So, the server
> sends messages to the client without conversion. In the above
> example, the server sends Japanese UTF-8 messages to psql, which
> expects those messages in SJIS.
>
>
> [Fix]
> Disable message localization during session startup. In other
> words, messages are output in English until the database session is
> established.

I think the question is whether the server encoding or English are
likely to be better for the average client. My bet is that the server
encoding is more likely correct.

However, you are right that English/ASCII at least will always be
viewable, while there are many server/client combinations that will
produce unreadable characters.

I would be interested to hear other people's experience with this.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Noah Misch <noah(at)leadboat(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2013-12-20 03:07:25
Message-ID: 20131220030725.GA1411150@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote:
> On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:
> > [Cause]
> > While the session is being established, the server cannot use the
> > client encoding for message conversion yet, because it cannot access
> > system catalogs to retrieve conversion functions. So, the server
> > sends messages to the client without conversion. In the above
> > example, the server sends Japanese UTF-8 messages to psql, which
> > expects those messages in SJIS.

Better to attack that directly. Arrange to apply any client_encoding named in
the startup packet earlier, before authentication. This relates to the TODO
item "Let the client indicate character encoding of database names, user
names, and passwords". (I expect such an endeavor to be tricky.)

> > [Fix]
> > Disable message localization during session startup. In other
> > words, messages are output in English until the database session is
> > established.
>
> I think the question is whether the server encoding or English are
> likely to be better for the average client. My bet is that the server
> encoding is more likely correct.
>
> However, you are right that English/ASCII at least will always be
> viewable, while there are many server/client combinations that will
> produce unreadable characters.
>
> I would be interested to hear other people's experience with this.

I don't have a sufficient sense of multilingualism among our users to know
whether English/ASCII messages would be more useful, on average, than
localized messages in the server encoding. Forcing English/ASCII does worsen
behavior in the frequent situation where client encoding will match server
encoding. I lean toward retaining the status quo of delivering localized
messages in the server encoding.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2013-12-20 13:46:58
Message-ID: 20131220134658.GS11006@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch escribió:
> On Tue, Dec 17, 2013 at 01:42:08PM -0500, Bruce Momjian wrote:
> > On Fri, Dec 13, 2013 at 10:41:17PM +0900, MauMau wrote:

> > > [Fix]
> > > Disable message localization during session startup. In other
> > > words, messages are output in English until the database session is
> > > established.
> >
> > I think the question is whether the server encoding or English are
> > likely to be better for the average client. My bet is that the server
> > encoding is more likely correct.
> >
> > However, you are right that English/ASCII at least will always be
> > viewable, while there are many server/client combinations that will
> > produce unreadable characters.
> >
> > I would be interested to hear other people's experience with this.
>
> I don't have a sufficient sense of multilingualism among our users to know
> whether English/ASCII messages would be more useful, on average, than
> localized messages in the server encoding. Forcing English/ASCII does worsen
> behavior in the frequent situation where client encoding will match server
> encoding. I lean toward retaining the status quo of delivering localized
> messages in the server encoding.

The problem is that if there's an encoding mismatch, the message might
be impossible to figure out. If the message is in english, at least it
can be searched for in the web, or something -- the user might even find
a page in which the english error string appears, with a native language
explanation.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Noah Misch" <noah(at)leadboat(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2013-12-22 10:51:55
Message-ID: 2C7051B8F82C4461A1AB02FE9A2E8CF1@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Noah Misch" <noah(at)leadboat(dot)com>
> Better to attack that directly. Arrange to apply any client_encoding
> named in
> the startup packet earlier, before authentication. This relates to the
> TODO
> item "Let the client indicate character encoding of database names, user
> names, and passwords". (I expect such an endeavor to be tricky.)

Unfortunately, character set conversion is not possible until the database
session is established, since it requires system catalog access. Please the
comment in src/backend/utils/mb/mbutils.c:

* During backend startup we can't set client encoding because we (a)
* can't look up the conversion functions, and (b) may not know the database
* encoding yet either. So SetClientEncoding() just accepts anything and
* remembers it for InitializeClientEncoding() to apply later.

I guess that's why Tom-san suggested the same solution as my patch (as a
compromise) in the below thread, which is also a TODO item:

Re: encoding of PostgreSQL messages
http://www.postgresql.org/message-id/19896.1234107496@sss.pgh.pa.us

From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
> The problem is that if there's an encoding mismatch, the message might
> be impossible to figure out. If the message is in english, at least it
> can be searched for in the web, or something -- the user might even find
> a page in which the english error string appears, with a native language
> explanation.

I feel like this, too. Being readable in English is better than being
unrecognizable.

Regards
MauMau


From: Noah Misch <noah(at)leadboat(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: alvherre(at)2ndquadrant(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2013-12-30 03:02:07
Message-ID: 20131230030207.GA1551279@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 22, 2013 at 07:51:55PM +0900, MauMau wrote:
> From: "Noah Misch" <noah(at)leadboat(dot)com>
> >Better to attack that directly. Arrange to apply any
> >client_encoding named in
> >the startup packet earlier, before authentication. This relates
> >to the TODO
> >item "Let the client indicate character encoding of database names, user
> >names, and passwords". (I expect such an endeavor to be tricky.)
>
> Unfortunately, character set conversion is not possible until the
> database session is established, since it requires system catalog
> access. Please the comment in src/backend/utils/mb/mbutils.c:
>
> * During backend startup we can't set client encoding because we (a)
> * can't look up the conversion functions, and (b) may not know the database
> * encoding yet either. So SetClientEncoding() just accepts anything and
> * remembers it for InitializeClientEncoding() to apply later.

Yes, changing that is the tricky part.

> I guess that's why Tom-san suggested the same solution as my patch
> (as a compromise) in the below thread, which is also a TODO item:
>
> Re: encoding of PostgreSQL messages
> http://www.postgresql.org/message-id/19896.1234107496@sss.pgh.pa.us

That's fair for the necessarily-earliest messages, like 'invalid value for
parameter "client_encoding"' and messages pertaining to the physical structure
of the startup packet. The client's encoding expectation is unknowable. An
error that mentions "client_encoding" will hopefully put users on the right
track regardless of how we translate and encode the surrounding words. The
other affected messages are quite technical, making a casual user unlikely to
fix or even see them. Not so for authentication messages, so I'm wary of
forcing use of ASCII that late in the handshake.

Note that choosing to use ASCII need not imply wholly declining to translate.
If the build uses GNU libiconv, gettext can emit ASCII approximations for
translations that conform to a Latin-derived alphabet, falling back to no
translation where the alphabet differs too much. pg_perm_setlocale(LC_CTYPE,
"C") requests such behavior. (The inferior iconv //TRANSLIT implementation of
GNU libc will convert non-ASCII characters to question marks, though.)

> From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
> >The problem is that if there's an encoding mismatch, the message might
> >be impossible to figure out. If the message is in english, at least it
> >can be searched for in the web, or something -- the user might even find
> >a page in which the english error string appears, with a native language
> >explanation.
>
> I feel like this, too. Being readable in English is better than
> being unrecognizable.

I agree that English consistently beats mojibake. I question whether that
makes up for the loss of translation when encodings do happen to match,
particularly for non-technical errors like a mistyped password. The
everything-UTF8 scenario appears often, perhaps explaining infrequent
complaints about the status quo. If 90% of translated message users have
client_encoding != server_encoding, then +1 for your patch's strategy. If the
figure is only 60%, I'd vote for holding out for a more-extensive fix that
allows us to encoding-convert localized authentication failure messages.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Noah Misch" <noah(at)leadboat(dot)com>
Cc: <alvherre(at)2ndquadrant(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-05 07:40:17
Message-ID: BBC72D617E0A4973A13783B09DD5CDB7@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Noah Misch" <noah(at)leadboat(dot)com>
> I agree that English consistently beats mojibake. I question whether that
> makes up for the loss of translation when encodings do happen to match,
> particularly for non-technical errors like a mistyped password. The
> everything-UTF8 scenario appears often, perhaps explaining infrequent
> complaints about the status quo. If 90% of translated message users have
> client_encoding != server_encoding, then +1 for your patch's strategy. If
> the
> figure is only 60%, I'd vote for holding out for a more-extensive fix that
> allows us to encoding-convert localized authentication failure messages.

I agree with you. It would be more friendly to users if more messages are
localized.

Then, as a happy medium, how about disabling message localization only if
the client encoding differs from the server one? That is, compare the
client_encoding value in the startup packet with the result of
GetPlatformEncoding(). If they don't match, call
disable_message_localization().

Regards
MauMau


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-07 02:55:32
Message-ID: 20140107025532.GA30539@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:
> From: "Noah Misch" <noah(at)leadboat(dot)com>
> >I agree that English consistently beats mojibake. I question whether that
> >makes up for the loss of translation when encodings do happen to match,
> >particularly for non-technical errors like a mistyped password. The
> >everything-UTF8 scenario appears often, perhaps explaining infrequent
> >complaints about the status quo. If 90% of translated message users have
> >client_encoding != server_encoding, then +1 for your patch's
> >strategy. If the
> >figure is only 60%, I'd vote for holding out for a more-extensive fix that
> >allows us to encoding-convert localized authentication failure messages.
>
> I agree with you. It would be more friendly to users if more
> messages are localized.
>
> Then, as a happy medium, how about disabling message localization
> only if the client encoding differs from the server one? That is,
> compare the client_encoding value in the startup packet with the
> result of GetPlatformEncoding(). If they don't match, call
> disable_message_localization().

I think the problem is we don't know the client and server encodings
at that time.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: "Noah Misch" <noah(at)leadboat(dot)com>, <alvherre(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-07 13:56:28
Message-ID: A641A0C673C94FCFA8F3A12886734274@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Bruce Momjian" <bruce(at)momjian(dot)us>
> On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:
>> Then, as a happy medium, how about disabling message localization
>> only if the client encoding differs from the server one? That is,
>> compare the client_encoding value in the startup packet with the
>> result of GetPlatformEncoding(). If they don't match, call
>> disable_message_localization().
>
> I think the problem is we don't know the client and server encodings
> at that time.

I suppose we know (or at least believe) those encodings during backend
startup:

* client encoding - the client_encoding parameter passed in the startup
packet, or if that's not present, client_encoding GUC value.

* server encoding - the encoding of strings gettext() returns. That is what
GetPlatformEncoding() returns.

Am I correct?

Regards
MauMau


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-09 12:55:57
Message-ID: CA+Tgmob8en1Dkk1c=REQEq8GVLKP0ts-=Q9siKcC_RpiYSERTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 7, 2014 at 8:56 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
> I suppose we know (or at least believe) those encodings during backend
> startup:
>
> * client encoding - the client_encoding parameter passed in the startup
> packet, or if that's not present, client_encoding GUC value.
>
> * server encoding - the encoding of strings gettext() returns. That is what
> GetPlatformEncoding() returns.

Suppose the startup packet itself is malformed. How will you report the error?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "Noah Misch" <noah(at)leadboat(dot)com>, "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-09 13:12:43
Message-ID: B1A11613008C463996B5E8FBD9612AA0@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
> Suppose the startup packet itself is malformed. How will you report the
> error?

I think we have no choice but to report the error in English, because we
don't know what the client wants.

Regards
MauMau


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "Noah Misch" <noah(at)leadboat(dot)com>, "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-09 13:12:43
Message-ID: CE033A8D95C148E2BD5E7A6B26E57D0E@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
> Suppose the startup packet itself is malformed. How will you report the
> error?

I think we have no choice but to report the error in English, because we
don't know what the client wants.

Regards
MauMau


From: Noah Misch <noah(at)leadboat(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-11 00:37:38
Message-ID: 20140111003738.GA1710819@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 07, 2014 at 10:56:28PM +0900, MauMau wrote:
> From: "Bruce Momjian" <bruce(at)momjian(dot)us>
> >On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:
> >>Then, as a happy medium, how about disabling message localization
> >>only if the client encoding differs from the server one? That is,
> >>compare the client_encoding value in the startup packet with the
> >>result of GetPlatformEncoding(). If they don't match, call
> >>disable_message_localization().

I like this proposal. Thanks.

> >I think the problem is we don't know the client and server encodings
> >at that time.
>
> I suppose we know (or at least believe) those encodings during
> backend startup:
>
> * client encoding - the client_encoding parameter passed in the
> startup packet, or if that's not present, client_encoding GUC value.
>
> * server encoding - the encoding of strings gettext() returns. That
> is what GetPlatformEncoding() returns.

Agreed. You would need to poke into the relevant part of the startup packet
much earlier than we do today, but that's tractable. Note that
GetPlatformEncoding() is gone; use GetMessageEncoding().

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-11 01:03:00
Message-ID: 20513.1389402180@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
>> On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:
>>> Then, as a happy medium, how about disabling message localization
>>> only if the client encoding differs from the server one? That is,
>>> compare the client_encoding value in the startup packet with the
>>> result of GetPlatformEncoding(). If they don't match, call
>>> disable_message_localization().

> I like this proposal. Thanks.
> ...
> Agreed. You would need to poke into the relevant part of the startup packet
> much earlier than we do today, but that's tractable.

There's still the problem of what to do before we have a complete startup
packet, or if the packet is defective enough to not contain a recognizable
client encoding.

Perhaps more to the point, what it sounds like this is doing is creating
a third behavioral state, in between what prevails when we're first
reading the packet and what prevails after we've finally adopted the
requested client encoding. I'm less than convinced that's a good thing.

I'm also rather unexcited by the idea of introducing redundant and/or
ad-hoc code to parse the startup packet. That sounds like a recipe for
bugs, some of which might even rise to security issues, considering it
would happen before client authentication.

I think if we're going to do anything like this at all, it'd be best
just to disable localization from postmaster fork up till we've gotten
a client encoding out of the packet in the normal course of events.

regards, tom lane


From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: MauMau <maumau307(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-11 01:49:20
Message-ID: 20140111014920.GB1710819@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 10, 2014 at 08:03:00PM -0500, Tom Lane wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
> >> On Sun, Jan 5, 2014 at 04:40:17PM +0900, MauMau wrote:
> >>> Then, as a happy medium, how about disabling message localization
> >>> only if the client encoding differs from the server one? That is,
> >>> compare the client_encoding value in the startup packet with the
> >>> result of GetPlatformEncoding(). If they don't match, call
> >>> disable_message_localization().
>
> > I like this proposal. Thanks.
> > ...
> > Agreed. You would need to poke into the relevant part of the startup packet
> > much earlier than we do today, but that's tractable.
>
> There's still the problem of what to do before we have a complete startup
> packet, or if the packet is defective enough to not contain a recognizable
> client encoding.

MauMau proposed using untranslated messages until we're past that point. I
like that answer fine, because routine mistakes from ordinary users will not
elicit the errors in question. The most interesting message in that group
might be 'invalid value for parameter "client_encoding"', and I think the
presence of the term "client_encoding" will be a sufficient clue regardless of
how we translate and encode the surrounding words.

> Perhaps more to the point, what it sounds like this is doing is creating
> a third behavioral state, in between what prevails when we're first
> reading the packet and what prevails after we've finally adopted the
> requested client encoding. I'm less than convinced that's a good thing.
>
> I'm also rather unexcited by the idea of introducing redundant and/or
> ad-hoc code to parse the startup packet. That sounds like a recipe for
> bugs, some of which might even rise to security issues, considering it
> would happen before client authentication.

Valid worries.

> I think if we're going to do anything like this at all, it'd be best
> just to disable localization from postmaster fork up till we've gotten
> a client encoding out of the packet in the normal course of events.

That was MauMau's original proposal. I opined upthread that it would be
better to change nothing than to do that.

nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "MauMau" <maumau307(at)gmail(dot)com>, "Noah Misch" <noah(at)leadboat(dot)com>
Cc: <alvherre(at)2ndquadrant(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-01-20 10:47:24
Message-ID: E90903BBCE674ABAAEF9862DA12BA6DE@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "MauMau" <maumau307(at)gmail(dot)com>
> From: "Noah Misch" <noah(at)leadboat(dot)com>
>> I agree that English consistently beats mojibake. I question whether
>> that
>> makes up for the loss of translation when encodings do happen to match,
>> particularly for non-technical errors like a mistyped password. The
>> everything-UTF8 scenario appears often, perhaps explaining infrequent
>> complaints about the status quo. If 90% of translated message users have
>> client_encoding != server_encoding, then +1 for your patch's strategy.
>> If the
>> figure is only 60%, I'd vote for holding out for a more-extensive fix
>> that
>> allows us to encoding-convert localized authentication failure messages.
>
> I agree with you. It would be more friendly to users if more messages are
> localized.
>
> Then, as a happy medium, how about disabling message localization only if
> the client encoding differs from the server one? That is, compare the
> client_encoding value in the startup packet with the result of
> GetPlatformEncoding(). If they don't match, call
> disable_message_localization().

I did this with the attached patch. I added some code in
BackendInitialize(). I'll update the CommitFest entry in a few days.

Regards
MauMau

Attachment Content-Type Size
no_localize_message_in_startup_v2.patch application/octet-stream 6.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "MauMau" <maumau307(at)gmail(dot)com>
Cc: "Noah Misch" <noah(at)leadboat(dot)com>, alvherre(at)2ndquadrant(dot)com, "Bruce Momjian" <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-04-05 04:56:00
Message-ID: 22855.1396673760@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"MauMau" <maumau307(at)gmail(dot)com> writes:
>> Then, as a happy medium, how about disabling message localization only if
>> the client encoding differs from the server one? That is, compare the
>> client_encoding value in the startup packet with the result of
>> GetPlatformEncoding(). If they don't match, call
>> disable_message_localization().

AFAICT this is not what was agreed to in this thread. It puts far too
much credence in the server-side default for client_encoding, which up to
now has never been thought to be very interesting; indeed I doubt most
people bother to set it at all. The reason that this issue is even on
the table is that that default is too likely to be wrong, no?

Also, whatever possessed you to use pg_get_encoding_from_locale to
identify the server's encoding? That's expensive and seems fairly
unlikely to yield the right answer. I don't remember offhand where we
keep the postmaster's idea of what encoding messages should be in, but I'm
fairly sure it's stored explicitly somewhere. Or if it isn't, we can for
sure do better than recalculating it during every connection attempt.

Having said all that, though, I'm unconvinced that this cure isn't worse
than the disease. Somebody claimed upthread that no very interesting
messages would be delocalized by a change like this, but that's complete
nonsense: in particular, *every* message associated with client
authentication will be sent in English if we go down this path. Given
the nearly complete lack of complaints in the many years that this code
has worked like this, I'm betting that most people will find a change
like this to be a net reduction in friendliness.

Given the changes here to extract client_encoding from the startup packet
ASAP, I wonder whether the right thing isn't just to set the client
encoding immediately when we do that. Most application libraries pass
client encoding in the startup packet anyway (libpq certainly does).

regards, tom lane


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, MauMau <maumau307(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-06-23 13:57:22
Message-ID: 53A83242.9010503@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/05/2014 07:56 AM, Tom Lane wrote:
> "MauMau" <maumau307(at)gmail(dot)com> writes:
>>> Then, as a happy medium, how about disabling message localization only if
>>> the client encoding differs from the server one? That is, compare the
>>> client_encoding value in the startup packet with the result of
>>> GetPlatformEncoding(). If they don't match, call
>>> disable_message_localization().
>
> AFAICT this is not what was agreed to in this thread. It puts far too
> much credence in the server-side default for client_encoding, which up to
> now has never been thought to be very interesting; indeed I doubt most
> people bother to set it at all. The reason that this issue is even on
> the table is that that default is too likely to be wrong, no?
>
> Also, whatever possessed you to use pg_get_encoding_from_locale to
> identify the server's encoding? That's expensive and seems fairly
> unlikely to yield the right answer. I don't remember offhand where we
> keep the postmaster's idea of what encoding messages should be in, but I'm
> fairly sure it's stored explicitly somewhere. Or if it isn't, we can for
> sure do better than recalculating it during every connection attempt.
>
> Having said all that, though, I'm unconvinced that this cure isn't worse
> than the disease. Somebody claimed upthread that no very interesting
> messages would be delocalized by a change like this, but that's complete
> nonsense: in particular, *every* message associated with client
> authentication will be sent in English if we go down this path. Given
> the nearly complete lack of complaints in the many years that this code
> has worked like this, I'm betting that most people will find a change
> like this to be a net reduction in friendliness.
>
> Given the changes here to extract client_encoding from the startup packet
> ASAP, I wonder whether the right thing isn't just to set the client
> encoding immediately when we do that. Most application libraries pass
> client encoding in the startup packet anyway (libpq certainly does).

Based on Tom's comments above, I'm marking this as returned with
feedback in the commitfest. I agree that setting client_encoding as
early as possible seems like the right thing to do.

Earlier in this thread, MauMau pointed out that we can't do encoding
conversions until we have connected to the database because you need to
read pg_conversion for that. That's because we support creating custom
conversions with CREATE CONVERSION. Frankly, I don't think anyone cares
about that feature. If we just dropped the CREATE/DROP CONVERSION
feature altogether and hard-coded the conversions we have, there would
be close to zero complaints. Even if you want to extend something around
encodings and conversions, the CREATE CONVERSION interface is clunky.
Firstly, conversions are per-database, and even schema-qualified, which
just seems like an extra complication. You'll most likely want to modify
the conversion across the whole system. Secondly, rather than define a
new conversion between encodings, you'll likely want to define a whole
new encoding with conversions to/from existing encodings, but you can't
do that anyway without hacking the source code.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, alvherre(at)2ndquadrant(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [bug fix] multibyte messages are displayed incorrectly on the client
Date: 2014-06-23 16:08:44
Message-ID: 16933.1403539724@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> Earlier in this thread, MauMau pointed out that we can't do encoding
> conversions until we have connected to the database because you need to
> read pg_conversion for that. That's because we support creating custom
> conversions with CREATE CONVERSION. Frankly, I don't think anyone cares
> about that feature. If we just dropped the CREATE/DROP CONVERSION
> feature altogether and hard-coded the conversions we have, there would
> be close to zero complaints. Even if you want to extend something around
> encodings and conversions, the CREATE CONVERSION interface is clunky.
> Firstly, conversions are per-database, and even schema-qualified, which
> just seems like an extra complication. You'll most likely want to modify
> the conversion across the whole system. Secondly, rather than define a
> new conversion between encodings, you'll likely want to define a whole
> new encoding with conversions to/from existing encodings, but you can't
> do that anyway without hacking the source code.

There's certainly something to be said for that position. If there were
any prospect of extensions defining new encodings someday, I'd argue for
keeping CREATE CONVERSION. But the performance headaches would be
substantial, and there aren't new encodings coming down the pike often
enough to justify the work involved, so I don't see us ever doing CREATE
ENCODING; and that means that CREATE CONVERSION is of little value.

I'd kind of like to see this go just because having catalog accesses
involved in encoding conversion setup is messy and fragile.

regards, tom lane