Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII

From: Noah Misch <noah(at)leadboat(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
Date: 2013-09-09 18:57:28
Message-ID: 20130909185728.GA217886@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote:
> On 9/6/13 10:37 AM, Tom Lane wrote:
> > BTW: personally, I would say that what you're looking at is a glibc bug.
> > I always thought the contract of gettext was to return the ASCII version
> > if it fails to produce a translated version. That might not be what the
> > end user really wants to see, but surely returning something like "???"
> > is completely useless to anybody.
>
> The question marks come from iconv. Take a look at what this prints:
>
> iconv po/ja.po -f utf-8 -t us-ascii//translit
>
> If you use GNU libiconv, this will print a bunch of question marks.

Actually, GNU libiconv's iconv() decides that //translit is unimplementable
for some of the characters in that file, and it fails the conversion. GNU
libc's iconv(), on the other hand, emits the question marks.

> I think the use of //translit by gettext is poor judgement, because my
> experiments show that the quality of the results is poor and not useful
> for a user interface.

It depends on the quality of the //translit implementation. GNU libiconv's
seems pretty good. It gives up for Japanese or Russian characters, so you get
the English messages. For Polish, GNU libiconv transliterates like this:

msgstr "nie można usunąć pliku lub katalogu \"%s\": %s\n"
msgstr "nie mozna usuna'c pliku lub katalogu \"%s\": %s\n"

That's fair, considering what it has to work with. Ideally, (a) GNU libc
should import the smarter transliteration code from GNU libiconv, and (b) GNU
gettext should check for weak //translit implementations and not use
//translit under such circumstances.

> My suggestion in this matter is to disable gettext processing when
> LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to
> something and LC_CTYPE is set to C. Or just do the warning and keep
> logging. Something like that.

In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to
happen, and no transliteration does happen for the PG messages. I think
MauMau's original bind_textdomain_codeset() proposal was on the right track.
We would need to do that for every relevant 3rd-party message domain, though.
Ick. This suggests to me that gettext really needs an API for overriding the
default codeset pertaining to message domains not subjected to
bind_textdomain_codeset(). In the meantime, adding bind_textdomain_codeset()
calls for known localized dependencies seems like a fine coping mechanism.

If we can reasonably detect when gettext is supplying useless ????? messages,
that's good, too.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-09-09 19:38:11 Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII
Previous Message Tomas Vondra 2013-09-09 18:07:12 Re: [rfc] overhauling pgstat.stat