Quick Links

Re: More message encoding woes

Lists:	pgsql-hackers

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	More message encoding woes
Date:	2009-03-30 12:52:37
Message-ID:	49D0C095.8000304@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

latin1db=# SELECT version();
version

-----------------------------------------------------------------------------------
PostgreSQL 8.3.7 on i686-pc-linux-gnu, compiled by GCC gcc (Debian
4.3.3-5) 4.3.3
(1 row)

latin1db=# SELECT * FROM foo;
ERROR: no existe la relaciÃ³n Â«fooÂ»

The accented characters are garbled. When I try the same with a database
that's in UTF8 in the same cluster, it works:

utf8db=# SELECT * FROM foo;
ERROR: no existe la relación «foo»

What is happening is that gettext() returns the message in the encoding
determined by LC_CTYPE, while we expect it to return it in the database
encoding. Starting with PG 8.3 we enforce that the encoding specified in
LC_CTYPE matches the database encoding, but not for the C locale.

In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
which fixes that, but we only do it on Windows. In earlier versions we
called it on all platforms, but only for UTF-8. It seems that we should
call bind_textdomain_codeset on all platforms and all encodings.
However, there seems to be a reason why we only do it for Windows on CVS
HEAD: we need a mapping from our encoding ID to the OS codeset name, and
the OS codeset names vary.

How can we make this more robust?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 13:32:08
Message-ID:	2233.1238419928@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
> which fixes that, but we only do it on Windows. In earlier versions we
> called it on all platforms, but only for UTF-8. It seems that we should
> call bind_textdomain_codeset on all platforms and all encodings.

Yes, this problem has been recognized for some time.

> However, there seems to be a reason why we only do it for Windows on CVS
> HEAD: we need a mapping from our encoding ID to the OS codeset name, and
> the OS codeset names vary.

> How can we make this more robust?

One possibility is to assume that the output of nl_langinfo(CODESET)
will be recognized by bind_textdomain_codeset(). Whether that actually
works can only be determined by experiment.

Another idea is to try the values listed in our encoding_match_list[]
until bind_textdomain_codeset succeeds. The problem here is that the
GNU documentation is *exceedingly* vague about whether
bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
when given a bad encoding name. (I guess we could look at the source
code.)

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 16:25:36
Message-ID:	49D0F280.2000105@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Another idea is to try the values listed in our encoding_match_list[]
> until bind_textdomain_codeset succeeds. The problem here is that the
> GNU documentation is *exceedingly* vague about whether
> bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
> when given a bad encoding name. (I guess we could look at the source
> code.)

Unfortunately it doesn't give any error. The value passed to it is just
stored, and isn't used until gettext(). Quick testing shows that if you
give an invalid encoding name, gettext will simply refrain from
translating anything and revert to English.

We could exploit that to determine if the codeset name we gave
bind_textdomain_codeset was valid: pick a string that is known to be
translated in all translations, like "syntax error", and see if
gettext("syntax error") returns the original string. Something along the
lines of:

const char *teststring = "syntax error";
encoding_match *m = encoding_match_list;
while(m->system_enc_name)
{
if (m->pg_enc_code != GetDatabaseEncoding())
continue;
bind_textdomain_codeset("postgres");
if (gettext(teststring) != teststring)
break; /* found! */
}

This feels rather hacky, but if we only do that with the combination of
LC_CTYPE=C and LC_MESSAGES=other than C that we have a problem with, I
think it would be ok. The current behavior is highly unlikely to give
correct results, so I don't think we can do much worse than that.

Another possibility is to just refrain from translating anything if
LC_CTYPE=C. If the above loop fails to find anything that works, that's
what we should fall back to IMHO.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 16:49:01
Message-ID:	17997.1238431741@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> Another idea is to try the values listed in our encoding_match_list[]
>> until bind_textdomain_codeset succeeds. The problem here is that the
>> GNU documentation is *exceedingly* vague about whether
>> bind_textdomain_codeset behaves sanely (ie throws a recognizable error)
>> when given a bad encoding name. (I guess we could look at the source
>> code.)

> Unfortunately it doesn't give any error.

(Man, why are the APIs in this problem space so universally awful?)

Where does it get the default codeset from? Maybe we could constrain
that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 17:06:48
Message-ID:	49D0FC28.4040404@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Where does it get the default codeset from? Maybe we could constrain
> that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

LC_CTYPE. In 8.3 and up where we constrain that to match the database
encoding, we only have a problem with the C locale.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 17:44:44
Message-ID:	20389.1238435084@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> Where does it get the default codeset from? Maybe we could constrain
>> that to match the database encoding, the way we do for LC_COLLATE/CTYPE?

> LC_CTYPE. In 8.3 and up where we constrain that to match the database
> encoding, we only have a problem with the C locale.

... and even if we wanted to fiddle with it, that just moves the problem
over to finding an LC_CTYPE value that matches the database encoding
:-(.

Yup, it's a mess. We'd have done this long ago if it were easy.

Could we get away with just unconditionally calling
bind_textdomain_codeset with *our* canonical spelling of the encoding
name? If it works, great, and if it doesn't, you get English.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 17:47:58
Message-ID:	49D105CE.5010705@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> Where does it get the default codeset from? Maybe we could constrain
>>> that to match the database encoding, the way we do for LC_COLLATE/CTYPE?
>
>> LC_CTYPE. In 8.3 and up where we constrain that to match the database
>> encoding, we only have a problem with the C locale.
>
> ... and even if we wanted to fiddle with it, that just moves the problem
> over to finding an LC_CTYPE value that matches the database encoding
> :-(.
>
> Yup, it's a mess. We'd have done this long ago if it were easy.
>
> Could we get away with just unconditionally calling
> bind_textdomain_codeset with *our* canonical spelling of the encoding
> name? If it works, great, and if it doesn't, you get English.

Yeah, that's better than nothing.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 18:04:00
Message-ID:	20690.1238436240@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> Could we get away with just unconditionally calling
>> bind_textdomain_codeset with *our* canonical spelling of the encoding
>> name? If it works, great, and if it doesn't, you get English.

> Yeah, that's better than nothing.

A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
says that it would not work quite well enough. The encoding names are
similar but not identical --- in particular I notice a lot of
discrepancies about dash versus underscore vs no separator at all.

What we need is an API equivalent to "iconv --list", but I'm not seeing
one :-(. Do we need to go so far as to try to run that program?
Its output format is poorly standardized, among other problems ...

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 18:36:52
Message-ID:	49D11144.4020507@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> What we need is an API equivalent to "iconv --list", but I'm not seeing
> one :-(.

There's also "locale -m". Looking at the implementation of that, it just
lists what's in /usr/share/i18n/charmaps. Not too portable either..

> Do we need to go so far as to try to run that program?
> Its output format is poorly standardized, among other problems ...

And doing that at every backend startup is too slow.

I would be happy to just revert to English if the OS doesn't recognize
the name we use for the encoding. What sucks about that most is that the
user has no way to specify the right encoding name even if he knows it.
I don't think we want to introduce a new GUC for that.

One idea is to extract the encoding from LC_MESSAGES. Then call
pg_get_encoding_from_locale() on that and check that it matches
server_encoding. If it does, great, pass it to
bind_textdomain_codeset(). If it doesn't, throw an error.

It stretches the conventional meaning LC_MESSAGES/LC_CTYPE a bit, since
LC_CTYPE usually specifies the codeset to use, but I think it's quite
intuitive.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-30 20:26:22
Message-ID:	1238444782.1329.100.camel@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane píše v po 30. 03. 2009 v 14:04 -0400:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > Tom Lane wrote:
> >> Could we get away with just unconditionally calling
> >> bind_textdomain_codeset with *our* canonical spelling of the encoding
> >> name? If it works, great, and if it doesn't, you get English.
>
> > Yeah, that's better than nothing.
>
> A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
> says that it would not work quite well enough. The encoding names are
> similar but not identical --- in particular I notice a lot of
> discrepancies about dash versus underscore vs no separator at all.

The same problem is with collation when you try restore database on
different OS. :(

Zdenek

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-31 19:18:42
Message-ID:	49D26C92.3080304@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> One idea is to extract the encoding from LC_MESSAGES. Then call
> pg_get_encoding_from_locale() on that and check that it matches
> server_encoding. If it does, great, pass it to
> bind_textdomain_codeset(). If it doesn't, throw an error.

I tried to implement this but it gets complicated. First of all, we can
only throw an error when lc_messages is set interactively. If it's set
in postgresql.conf, it might be valid for some databases but not for
others with different encoding. And that makes per-user lc_messages
setting quite hard too.

Another complication is what to do if e.g. plpgsql or a 3rd party module
have called pg_bindtextdomain, when lc_messages=C and we don't yet know
the system name for the database encoding, and you later set
lc_messages='fi_FI.iso8859-1', in a latin1 database. In order to
retroactively set the codeset, we'd have to remember all the calls to
pg_bindtextdomain. Not impossible, for sure, but more work.

I'm leaning towards the idea of trying out all the spellings of the
database encoding we have in encoding_match_list. That gives the best
user experience, as it just works, and it doesn't seem that complicated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-31 19:39:08
Message-ID:	4989.1238528348@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> I'm leaning towards the idea of trying out all the spellings of the
> database encoding we have in encoding_match_list. That gives the best
> user experience, as it just works, and it doesn't seem that complicated.

How were you going to check --- use that idea of translating a string
that's known to have a translation? OK, but you'd better document
somewhere where translators will read it "you must translate this string
first of all". Maybe use a special string "Translate Me First" that
doesn't actually need to be end-user-visible, just so no one sweats over
getting it right in context. (I can see "syntax error" being
problematic in some translations, since translators will know it is
always just a fragment of a larger message ...)

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-31 19:50:10
Message-ID:	49D273F2.8040000@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> I'm leaning towards the idea of trying out all the spellings of the
>> database encoding we have in encoding_match_list. That gives the best
>> user experience, as it just works, and it doesn't seem that complicated.
>
> How were you going to check --- use that idea of translating a string
> that's known to have a translation? OK, but you'd better document
> somewhere where translators will read it "you must translate this string
> first of all". Maybe use a special string "Translate Me First" that
> doesn't actually need to be end-user-visible, just so no one sweats over
> getting it right in context.

Yep, something like that. There seems to be a magic empty string
translation at the beginning of every po file that returns the
meta-information about the translation, like translation author and
date. Assuming that works reliably, I'll use that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-31 19:59:24
Message-ID:	5388.1238529564@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> Maybe use a special string "Translate Me First" that
>> doesn't actually need to be end-user-visible, just so no one sweats over
>> getting it right in context.

> Yep, something like that. There seems to be a magic empty string
> translation at the beginning of every po file that returns the
> meta-information about the translation, like translation author and
> date. Assuming that works reliably, I'll use that.

At first that sounded like an ideal answer, but I can see a gotcha:
suppose the translation's author's name contains some characters that
don't convert to the database encoding. I suppose that would result in
failure, when we'd prefer it not to. A single-purpose string could be
documented as "whatever you translate this to should be pure ASCII,
never mind if it's sensible".

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-31 20:06:28
Message-ID:	20090331200628.GC23023@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

> At first that sounded like an ideal answer, but I can see a gotcha:
> suppose the translation's author's name contains some characters that
> don't convert to the database encoding. I suppose that would result in
> failure, when we'd prefer it not to. A single-purpose string could be
> documented as "whatever you translate this to should be pure ASCII,
> never mind if it's sensible".

One problem with this idea is that it may be hard to coerce gettext into
putting a particular string at the top of the file :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-03-31 20:45:11
Message-ID:	10431.1238532311@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> At first that sounded like an ideal answer, but I can see a gotcha:
>> suppose the translation's author's name contains some characters that
>> don't convert to the database encoding. I suppose that would result in
>> failure, when we'd prefer it not to. A single-purpose string could be
>> documented as "whatever you translate this to should be pure ASCII,
>> never mind if it's sensible".

> One problem with this idea is that it may be hard to coerce gettext into
> putting a particular string at the top of the file :-(

I doubt we can, which is why the documentation needs to tell translators
about it.

regards, tom lane

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: More message encoding woes
Date:	2009-03-31 21:12:47
Message-ID:	200904010012.48345.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 30 March 2009 21:04:00 Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > Tom Lane wrote:
> >> Could we get away with just unconditionally calling
> >> bind_textdomain_codeset with *our* canonical spelling of the encoding
> >> name? If it works, great, and if it doesn't, you get English.
> >
> > Yeah, that's better than nothing.
>
> A quick look at the output of "iconv --list" on Fedora 10 and OSX 10.5.6
> says that it would not work quite well enough. The encoding names are
> similar but not identical --- in particular I notice a lot of
> discrepancies about dash versus underscore vs no separator at all.

I seem to recall that the encoding names are normalized by the C library
somewhere, but I can't find the documentation now. It might be worth trying
anyway -- the above might not in fact be a problem.

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: More message encoding woes
Date:	2009-03-31 21:16:33
Message-ID:	200904010016.34667.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Where does it get the default codeset from? Maybe we could constrain
> > that to match the database encoding, the way we do for LC_COLLATE/CTYPE?
>
> LC_CTYPE. In 8.3 and up where we constrain that to match the database
> encoding, we only have a problem with the C locale.

Why don't we apply the same restriction to the C locale then?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: More message encoding woes
Date:	2009-03-31 21:46:42
Message-ID:	16964.1238536002@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On Monday 30 March 2009 20:06:48 Heikki Linnakangas wrote:
>> LC_CTYPE. In 8.3 and up where we constrain that to match the database
>> encoding, we only have a problem with the C locale.

> Why don't we apply the same restriction to the C locale then?

(1) what would you constrain it to?

(2) historically we've allowed C locale to be used with any encoding,
and there are a *lot* of users depending on that (particularly in the
Far East, I gather).

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-01 02:32:39
Message-ID:	20090401023239.GA3377@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:

> > One problem with this idea is that it may be hard to coerce gettext into
> > putting a particular string at the top of the file :-(
>
> I doubt we can, which is why the documentation needs to tell translators
> about it.

I doubt that documenting the issue will be enough (in fact I'm pretty
sure it won't). Maybe we can just supply the string translated in our
POT files, and add a comment that the translator is not supposed to
touch it. This doesn't seem all that difficult -- I think it just
requires that we add a msgmerge step to "make update-po" that uses a
file on which the message has already been translated.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-01 11:31:33
Message-ID:	49D35095.5020900@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> Maybe use a special string "Translate Me First" that
>>> doesn't actually need to be end-user-visible, just so no one sweats over
>>> getting it right in context.
>
>> Yep, something like that. There seems to be a magic empty string
>> translation at the beginning of every po file that returns the
>> meta-information about the translation, like translation author and
>> date. Assuming that works reliably, I'll use that.
>
> At first that sounded like an ideal answer, but I can see a gotcha:
> suppose the translation's author's name contains some characters that
> don't convert to the database encoding. I suppose that would result in
> failure, when we'd prefer it not to. A single-purpose string could be
> documented as "whatever you translate this to should be pure ASCII,
> never mind if it's sensible".

I just tried that, and it seems that gettext() does transliteration, so
any characters that have no counterpart in the database encoding will be
replaced with something similar, or question marks. Assuming that's
universal across platforms, and I think it is, using the empty string
should work.

It also means that you can use lc_messages='ja' with
server_encoding='latin1', but it will be unreadable because all the
non-ascii characters are replaced with question marks. For something
like lc_messages='es_ES' and server_encoding='koi8-r', it will still
look quite nice.

Attached is a patch I've been testing. Seems to work quite well. It
would be nice if someone could test it on Windows, which seems to be a
bit special in this regard.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
gettext-codeset-1.patch	text/x-diff	8.5 KB

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-01 17:14:23
Message-ID:	49D3A0EF.8030505@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> Tom Lane wrote:
>>>> Maybe use a special string "Translate Me First" that
>>>> doesn't actually need to be end-user-visible, just so no one sweats
>>>> over
>>>> getting it right in context.
>>
>>> Yep, something like that. There seems to be a magic empty string
>>> translation at the beginning of every po file that returns the
>>> meta-information about the translation, like translation author and
>>> date. Assuming that works reliably, I'll use that.
>>
>> At first that sounded like an ideal answer, but I can see a gotcha:
>> suppose the translation's author's name contains some characters that
>> don't convert to the database encoding. I suppose that would result in
>> failure, when we'd prefer it not to. A single-purpose string could be
>> documented as "whatever you translate this to should be pure ASCII,
>> never mind if it's sensible".
>
> I just tried that, and it seems that gettext() does transliteration, so
> any characters that have no counterpart in the database encoding will be
> replaced with something similar, or question marks.
> Assuming that's
> universal across platforms, and I think it is, using the empty string
> should work.
>
> It also means that you can use lc_messages='ja' with
> server_encoding='latin1', but it will be unreadable because all the
> non-ascii characters are replaced with question marks.

It doesn't occur in the current Windows environment. As for Windows
gnu gettext which we are using, we would see the original msgid when
iconv can't convert the msgstr to the target codeset.

set client_encoding to utf_8;
SET
show server_encoding;
server_encoding
-----------------
LATIN1
(1 row)

show lc_messages;
lc_messages
--------------------
Japanese_Japan.932
(1 row)

1;
ERROR: syntax error at or near "1"
LINE 1: 1;

OTOH when the sever encoding is utf8 then

set client_encoding to utf_8;
SET
show server_encoding;
server_encoding
-----------------
UTF8
(1 row)

show lc_messages;
lc_messages
--------------------
Japanese_Japan.932
(1 row)

1;
ERROR: "1"またはその近辺で構文エラー
LINE 1: 1; ^

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-01 17:36:40
Message-ID:	8233.1238607400@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp> writes:
> Heikki Linnakangas wrote:
>> I just tried that, and it seems that gettext() does transliteration, so
>> any characters that have no counterpart in the database encoding will be
>> replaced with something similar, or question marks.

> It doesn't occur in the current Windows environment. As for Windows
> gnu gettext which we are using, we would see the original msgid when
> iconv can't convert the msgstr to the target codeset.

Well, if iconv has no conversion to the codeset at all then there is no
point in selecting that particular codeset setting anyway. The question
was about whether we can distinguish "no conversion available" from
"conversion available, but the test string has some unconvertible
characters".

regards, tom lane

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-01 18:57:42
Message-ID:	49D3B926.70400@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp> writes:
>> Heikki Linnakangas wrote:
>>> I just tried that, and it seems that gettext() does transliteration, so
>>> any characters that have no counterpart in the database encoding will be
>>> replaced with something similar, or question marks.
>
>> It doesn't occur in the current Windows environment. As for Windows
>> gnu gettext which we are using, we would see the original msgid when
>> iconv can't convert the msgstr to the target codeset.
>
> Well, if iconv has no conversion to the codeset at all then there is no
> point in selecting that particular codeset setting anyway. The question
> was about whether we can distinguish "no conversion available" from
> "conversion available, but the test string has some unconvertible
> characters".

What I meant is we would see no '?' when we use Windows gnu gettext.
Whether conversion available or not depends on individual msgids.
For example, when the Japanese msgstr corresponding to a msgid has
no characters other than ASCII accidentally, Windows gnu gettext will
use the msgstr not the original msgid.

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-02 12:03:21
Message-ID:	49D4A989.8020907@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> Tom Lane wrote:
>>>> Maybe use a special string "Translate Me First" that
>>>> doesn't actually need to be end-user-visible, just so no one sweats
>>>> over
>>>> getting it right in context.
>>
>>> Yep, something like that. There seems to be a magic empty string
>>> translation at the beginning of every po file that returns the
>>> meta-information about the translation, like translation author and
>>> date. Assuming that works reliably, I'll use that.
>>
>> At first that sounded like an ideal answer, but I can see a gotcha:
>> suppose the translation's author's name contains some characters that
>> don't convert to the database encoding. I suppose that would result in
>> failure, when we'd prefer it not to. A single-purpose string could be
>> documented as "whatever you translate this to should be pure ASCII,
>> never mind if it's sensible".
>
> I just tried that, and it seems that gettext() does transliteration, so
> any characters that have no counterpart in the database encoding will be
> replaced with something similar, or question marks. Assuming that's
> universal across platforms, and I think it is, using the empty string
> should work.
>
> It also means that you can use lc_messages='ja' with
> server_encoding='latin1', but it will be unreadable because all the
> non-ascii characters are replaced with question marks. For something
> like lc_messages='es_ES' and server_encoding='koi8-r', it will still
> look quite nice.
>
> Attached is a patch I've been testing. Seems to work quite well. It
> would be nice if someone could test it on Windows, which seems to be a
> bit special in this regard.

Unfortunately it doesn't seem to work on Windows.

First any combination of valid lc_messages and non-existent encoding
passes the test strcmp(gettext(""), "") != 0 .
Second for example the combination of ja(lc_messages) and ISO-8859-1
passes the the test but the test fails after I changed the last_trans
lator part of ja message catalog to contain Japanese kanji characters.

regards,
Hiroshi Inoue

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: More message encoding woes
Date:	2009-04-02 19:48:38
Message-ID:	200904022248.39545.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote:
> What is happening is that gettext() returns the message in the encoding
> determined by LC_CTYPE, while we expect it to return it in the database
> encoding. Starting with PG 8.3 we enforce that the encoding specified in
> LC_CTYPE matches the database encoding, but not for the C locale.
>
> In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
> which fixes that, but we only do it on Windows. In earlier versions we
> called it on all platforms, but only for UTF-8. It seems that we should
> call bind_textdomain_codeset on all platforms and all encodings.
> However, there seems to be a reason why we only do it for Windows on CVS
> HEAD: we need a mapping from our encoding ID to the OS codeset name, and
> the OS codeset names vary.
>
> How can we make this more robust?

Another approach might be to create a new configuration parameter that
basically tells what encoding to call bind_textdomain_codeset() with, say
server_encoding_for_gettext. If that is not set, you just use server_encoding
as is and hope that gettext() takes it (which it would in most cases, I
guess).

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-03 03:32:44
Message-ID:	49D5835C.1010404@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hiroshi Inoue wrote:
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>>> Tom Lane wrote:
>>>>> Maybe use a special string "Translate Me First" that
>>>>> doesn't actually need to be end-user-visible, just so no one sweats
>>>>> over
>>>>> getting it right in context.
>>>
>>>> Yep, something like that. There seems to be a magic empty string
>>>> translation at the beginning of every po file that returns the
>>>> meta-information about the translation, like translation author and
>>>> date. Assuming that works reliably, I'll use that.
>>>
>>> At first that sounded like an ideal answer, but I can see a gotcha:
>>> suppose the translation's author's name contains some characters that
>>> don't convert to the database encoding. I suppose that would result in
>>> failure, when we'd prefer it not to. A single-purpose string could be
>>> documented as "whatever you translate this to should be pure ASCII,
>>> never mind if it's sensible".
>>
>> I just tried that, and it seems that gettext() does transliteration,
>> so any characters that have no counterpart in the database encoding
>> will be replaced with something similar, or question marks. Assuming
>> that's universal across platforms, and I think it is, using the empty
>> string should work.
>>
>> It also means that you can use lc_messages='ja' with
>> server_encoding='latin1', but it will be unreadable because all the
>> non-ascii characters are replaced with question marks. For something
>> like lc_messages='es_ES' and server_encoding='koi8-r', it will still
>> look quite nice.
>>
>> Attached is a patch I've been testing. Seems to work quite well. It
>> would be nice if someone could test it on Windows, which seems to be a
>> bit special in this regard.
>
> Unfortunately it doesn't seem to work on Windows.

Is it unappropriate to call iconv_open() to check if the codeset is
valid for bind_textdomain_codeset()?

regards,
Hiroshi Inoue

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: More message encoding woes
Date:	2009-04-06 19:47:37
Message-ID:	200904062247.38095.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 30 March 2009 15:52:37 Heikki Linnakangas wrote:
> In CVS HEAD, we call bind_textdomain_codeset() in SetDatabaseEncoding()
> which fixes that, but we only do it on Windows. In earlier versions we
> called it on all platforms, but only for UTF-8. It seems that we should
> call bind_textdomain_codeset on all platforms and all encodings.
> However, there seems to be a reason why we only do it for Windows on CVS
> HEAD: we need a mapping from our encoding ID to the OS codeset name, and
> the OS codeset names vary.

In practice you get either the GNU or the Solaris version of gettext, and at
least the GNU version can cope with all the encoding names that the currently
Windows-only code path produces. So enabling the Windows code path for all
platforms when ENABLE_NLS is on and LC_CTYPE is C would appear to work in
sufficiently many cases.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: More message encoding woes
Date:	2009-04-07 08:21:25
Message-ID:	49DB0D05.8070101@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut wrote:
> In practice you get either the GNU or the Solaris version of gettext, and at
> least the GNU version can cope with all the encoding names that the currently
> Windows-only code path produces.

It doesn't. On my laptop running Debian testing:

hlinnaka(at)heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext
gettext: ei riittävästi argumentteja
hlinnaka(at)heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext
gettext: missing arguments
hlinnaka(at)heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext
gettext: ei riitt�v�sti argumentteja

Using the name for the latin1 encoding in the currently Windows-only
mapping table, "LATIN1", you get no translation because that name is not
recognized by the system. Using the other name "ISO-8859-1", it works.
"LATIN1" is not listed in the output of locale -m either.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: More message encoding woes
Date:	2009-04-07 09:38:09
Message-ID:	200904071238.09734.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote:
> Peter Eisentraut wrote:
> > In practice you get either the GNU or the Solaris version of gettext, and
> > at least the GNU version can cope with all the encoding names that the
> > currently Windows-only code path produces.
>
> It doesn't. On my laptop running Debian testing:
>
> hlinnaka(at)heikkilaptop:~$ LC_ALL=fi_FI.UTF-8 gettext
> gettext: ei riittävästi argumentteja
> hlinnaka(at)heikkilaptop:~$ LC_ALL=fi_FI.LATIN1 gettext
> gettext: missing arguments

That is because no locale by the name fi_FI.LATIN1 exists.

> hlinnaka(at)heikkilaptop:~$ LC_ALL=fi_FI.ISO-8859-1 gettext
> gettext: ei riitt�v�sti argumentteja
>
> Using the name for the latin1 encoding in the currently Windows-only
> mapping table, "LATIN1", you get no translation because that name is not
> recognized by the system. Using the other name "ISO-8859-1", it works.
> "LATIN1" is not listed in the output of locale -m either.

You are looking in the wrong place. What we need is for iconv to recognize
the encoding name used by PostgreSQL. iconv --list is the primary hint for
that.

The locale names provided by the operating system are arbitrary and unrelated.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-07 09:41:18
Message-ID:	49DB1FBE.3040001@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hiroshi Inoue wrote:
> Heikki Linnakangas wrote:
>> I just tried that, and it seems that gettext() does transliteration,
>> so any characters that have no counterpart in the database encoding
>> will be replaced with something similar, or question marks. Assuming
>> that's universal across platforms, and I think it is, using the empty
>> string should work.
>>
>> It also means that you can use lc_messages='ja' with
>> server_encoding='latin1', but it will be unreadable because all the
>> non-ascii characters are replaced with question marks. For something
>> like lc_messages='es_ES' and server_encoding='koi8-r', it will still
>> look quite nice.
>>
>> Attached is a patch I've been testing. Seems to work quite well. It
>> would be nice if someone could test it on Windows, which seems to be a
>> bit special in this regard.
>
> Unfortunately it doesn't seem to work on Windows.
>
> First any combination of valid lc_messages and non-existent encoding
> passes the test strcmp(gettext(""), "") != 0 .

Now that's strange. Can you check what gettext("") returns in that case
then?

> Second for example the combination of ja(lc_messages) and ISO-8859-1
> passes the the test but the test fails after I changed the last_trans
> lator part of ja message catalog to contain Japanese kanji characters.

Yeah, the inconsistency is not nice. In practice, though, if you try to
use an encoding that can't represent kanji characters with Japanese,
you're better off falling back to English than displaying strings full
of question marks. The same goes for all other languages as well, IMHO.
If you're going to fall back to English for some translations (and in
practice "some" is a pretty high percentage) because the encoding is
missing a character and transliteration is not working, you might as
well not bother translating at all.

If we add the dummy translations to all .po files, we could force
fallback-to-English in situations like that by including some or all of
the non-ASCII characters used in the language in the dummy translation.

I'm thinking of going ahead with this approach, without the dummy
translation, after we have resolved the first issue on Windows. We can
add the dummy translations later if needed, but I don't think anyone
will care.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: More message encoding woes
Date:	2009-04-07 10:09:42
Message-ID:	49DB2666.1050800@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut wrote:
> On Tuesday 07 April 2009 11:21:25 Heikki Linnakangas wrote:
>> Using the name for the latin1 encoding in the currently Windows-only
>> mapping table, "LATIN1", you get no translation because that name is not
>> recognized by the system. Using the other name "ISO-8859-1", it works.
>> "LATIN1" is not listed in the output of locale -m either.
>
> You are looking in the wrong place. What we need is for iconv to recognize
> the encoding name used by PostgreSQL. iconv --list is the primary hint for
> that.
>
> The locale names provided by the operating system are arbitrary and unrelated.

Oh, ok. I guess we can do the simple fix you proposed then.

Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
"pg_get_encoding_from_locale(NULL) == encoding" which is more close to
what we actually want. The downside is that
pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
that we don't need to keep this in sync with the rules we have in CREATE
DATABASE that enforce that locale matches encoding.

This doesn't include the cleanup to make the mapping table easier to
maintain that Magnus was going to have a look at before I started this
thread.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
simple-gettext-clocale-fix-1.patch	text/x-diff	2.1 KB

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-07 10:22:47
Message-ID:	49DB2977.7070506@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Hiroshi Inoue wrote:
>> Heikki Linnakangas wrote:
>>> I just tried that, and it seems that gettext() does transliteration,
>>> so any characters that have no counterpart in the database encoding
>>> will be replaced with something similar, or question marks. Assuming
>>> that's universal across platforms, and I think it is, using the empty
>>> string should work.
>>>
>>> It also means that you can use lc_messages='ja' with
>>> server_encoding='latin1', but it will be unreadable because all the
>>> non-ascii characters are replaced with question marks. For something
>>> like lc_messages='es_ES' and server_encoding='koi8-r', it will still
>>> look quite nice.
>>>
>>> Attached is a patch I've been testing. Seems to work quite well. It
>>> would be nice if someone could test it on Windows, which seems to be
>>> a bit special in this regard.
>>
>> Unfortunately it doesn't seem to work on Windows.
>>
>> First any combination of valid lc_messages and non-existent encoding
>> passes the test strcmp(gettext(""), "") != 0 .
>
> Now that's strange. Can you check what gettext("") returns in that case
> then?

Translated but not converted string. I'm not sure if it's a bug or not.
I can see no description what should be returned in such case.

>> Second for example the combination of ja(lc_messages) and ISO-8859-1
>> passes the the test but the test fails after I changed the last_trans
>> lator part of ja message catalog to contain Japanese kanji characters.
>
> Yeah, the inconsistency is not nice. In practice, though, if you try to
> use an encoding that can't represent kanji characters with Japanese,
> you're better off falling back to English than displaying strings full
> of question marks. The same goes for all other languages as well, IMHO.
> If you're going to fall back to English for some translations (and in
> practice "some" is a pretty high percentage) because the encoding is
> missing a character and transliteration is not working, you might as
> well not bother translating at all.

What is wrong with checking if the codeset is valid using iconv_open()?

regards,
Hiroshi Inoue

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-07 10:30:37
Message-ID:	49DB2B4D.9090906@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hiroshi Inoue wrote:
> What is wrong with checking if the codeset is valid using iconv_open()?

That would probably work as well. We'd have to decide what we'd try to
convert from with iconv_open(). Utf-8 might be a safe choice. We don't
currently use iconv_open() anywhere in the backend, though, so I'm
hesitant to add a dependency for this. GNU gettext() uses iconv, but I'm
not sure if that's true for all gettext() implementations.

Peter's suggestion seems the best ATM, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-07 13:59:13
Message-ID:	616.1239112753@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Hiroshi Inoue wrote:
>> What is wrong with checking if the codeset is valid using iconv_open()?

> That would probably work as well. We'd have to decide what we'd try to
> convert from with iconv_open().

The problem I have with that is that you are now guessing at *two*
platform-specific encoding names not one, plus hoping there is a
conversion between the two.

If we knew the encoding name embedded in the .mo file we wanted to use,
then it would be sensible to try to use that as the source codeset.

> GNU gettext() uses iconv, but I'm
> not sure if that's true for all gettext() implementations.

Yeah, that's another problem.

regards, tom lane

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: More message encoding woes
Date:	2009-04-07 18:24:39
Message-ID:	200904072124.40162.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:
> Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
> "pg_get_encoding_from_locale(NULL) == encoding" which is more close to
> what we actually want. The downside is that
> pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
> that we don't need to keep this in sync with the rules we have in CREATE
> DATABASE that enforce that locale matches encoding.

I would have figured we can skip this whole thing when LC_CTYPE != C, because
it should be guaranteed that LC_CTYPE matches the database encoding in this
case, no?

Other than that, I think this patch is good.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: More message encoding woes
Date:	2009-04-07 18:49:11
Message-ID:	49DBA027.3010809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut wrote:
> On Tuesday 07 April 2009 13:09:42 Heikki Linnakangas wrote:
>> Patch attached. Instead of checking for LC_CTYPE == C, I'm checking
>> "pg_get_encoding_from_locale(NULL) == encoding" which is more close to
>> what we actually want. The downside is that
>> pg_get_encoding_from_locale(NULL) isn't exactly free, but the upside is
>> that we don't need to keep this in sync with the rules we have in CREATE
>> DATABASE that enforce that locale matches encoding.
>
> I would have figured we can skip this whole thing when LC_CTYPE != C, because
> it should be guaranteed that LC_CTYPE matches the database encoding in this
> case, no?

Yes, except if pg_get_encoding_from_locale() couldn't figure out what PG
encoding LC_CTYPE corresponds to. We let CREATE DATABASE to go ahead in
that case, trusting that the user knows what he's doing. I suppose we
can extend that trust to this case too, and assume that the encoding of
LC_CTYPE actually matches the database encoding.

Or if the encoding is UTF-8 and you're running on Windows, although on
Windows we want to always call bind_textdomain_codeset(). Or if the
database encoding is SQL_ASCII, although in that case we don't want to
call bind_textdomain_codeset() either.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: More message encoding woes
Date:	2009-04-08 00:29:29
Message-ID:	49DBEFE9.8030703@tpf.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Hiroshi Inoue wrote:
>>> What is wrong with checking if the codeset is valid using iconv_open()?
>
>> That would probably work as well. We'd have to decide what we'd try to
>> convert from with iconv_open().
>
> The problem I have with that is that you are now guessing at *two*
> platform-specific encoding names not one, plus hoping there is a
> conversion between the two.

AFAIK iconv_open() supports all combinations of the valid encoding
values. Or we may be able to check it using the same encoding for
both from and to.

regards,
Hiroshi Inoue

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: More message encoding woes
Date:	2009-04-08 10:25:29
Message-ID:	49DC7B99.20206@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ok, committed it like that after all.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com