Re: Encoding and i18n

Lists: pgsql-hackers
From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Encoding and i18n
Date: 2007-10-05 22:18:21
Message-ID: 87myuxnr2q.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Reading the commit message about the TZ encoding issue I'm curious why this
isn't a more widespread problem. How does gettext now what encoding we want
messages in? How do we prevent things like to_char(now(),'month') from
producing strings in an encoding different from the database's encoding?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 15:15:55
Message-ID: 20071006151555.GE5618@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
>
> Reading the commit message about the TZ encoding issue I'm curious why this
> isn't a more widespread problem. How does gettext now what encoding we want
> messages in? How do we prevent things like to_char(now(),'month') from
> producing strings in an encoding different from the database's encoding?

The PO files include encoding information, so it's easy for the server
to recode them from that to the server (or client) encoding, as
appropriate.

Of course, then it is up to the translator to get it right ... but I
think when he doesn't, people notice fairly quickly.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 15:19:43
Message-ID: 87sl4otgmo.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Alvaro Herrera" <alvherre(at)commandprompt(dot)com> writes:

> Gregory Stark wrote:
>>
>> Reading the commit message about the TZ encoding issue I'm curious why this
>> isn't a more widespread problem. How does gettext now what encoding we want
>> messages in? How do we prevent things like to_char(now(),'month') from
>> producing strings in an encoding different from the database's encoding?
>
> The PO files include encoding information, so it's easy for the server
> to recode them from that to the server (or client) encoding, as
> appropriate.

So does the _() macro automatically recode it to the current server encoding?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 15:25:34
Message-ID: 23680.1191684334@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Reading the commit message about the TZ encoding issue I'm curious why this
> isn't a more widespread problem. How does gettext now what encoding we want
> messages in? How do we prevent things like to_char(now(),'month') from
> producing strings in an encoding different from the database's encoding?

The short answer is it's all a house of cards, and if you troll
the archives you will find plenty of bug reports traceable to
misconfiguration in this area. The recent attempt to enforce
that nl_langinfo(CODESET) matches the database encoding is a first
step towards making this more bulletproof, but we're finding out
that even that is harder than it looks.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 15:33:08
Message-ID: 20071006153308.GI5618@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Alvaro Herrera" <alvherre(at)commandprompt(dot)com> writes:
>
> > Gregory Stark wrote:
> >>
> >> Reading the commit message about the TZ encoding issue I'm curious why this
> >> isn't a more widespread problem. How does gettext now what encoding we want
> >> messages in? How do we prevent things like to_char(now(),'month') from
> >> producing strings in an encoding different from the database's encoding?
> >
> > The PO files include encoding information, so it's easy for the server
> > to recode them from that to the server (or client) encoding, as
> > appropriate.
>
> So does the _() macro automatically recode it to the current server encoding?

Well, I'm not sure if it's _(), elog() or what, but it does get recoded.
If I have a different client_encoding and get a NOTICE, then both the
server and client get a message in the corresponding encoding.

In fact this is the reason for the most common "PANIC: stack overflow"
in elog.c error stack. When a message needs to be recoded but the
recoding procedure errors out, it wants to report that and this one also
fails, you get infinite recursion and nothing can get reported.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 15:35:18
Message-ID: 23861.1191684918@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> So does the _() macro automatically recode it to the current server encoding?

From the gettext manual:

---

gettext not only looks up a translation in a message catalog. It also
converts the translation on the fly to the desired output character
set. This is useful if the user is working in a different character set
than the translator who created the message catalog, because it avoids
distributing variants of message catalogs which differ only in the
character set.

The output character set is, by default, the value of nl_langinfo
(CODESET), which depends on the LC_CTYPE part of the current locale. But
programs which store strings in a locale independent way (e.g. UTF-8)
can request that gettext and related functions return the translations
in that encoding, by use of the bind_textdomain_codeset function.

---

We don't currently call bind_textdomain_codeset, in part because of the
lack of portability of names for codesets.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 16:49:02
Message-ID: 87lkagtcht.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Alvaro Herrera" <alvherre(at)commandprompt(dot)com> writes:

> Gregory Stark wrote:
>
>> So does the _() macro automatically recode it to the current server encoding?
>
> Well, I'm not sure if it's _(), elog() or what, but it does get recoded.
> If I have a different client_encoding and get a NOTICE, then both the
> server and client get a message in the corresponding encoding.

Actually I was thinking about things like formatting.c which take localized
strings and return them as data which can end up in the database. If they're
in the wrong encoding then they'll be invalidly encoded strings in the
database.

> In fact this is the reason for the most common "PANIC: stack overflow"
> in elog.c error stack. When a message needs to be recoded but the
> recoding procedure errors out, it wants to report that and this one also
> fails, you get infinite recursion and nothing can get reported.

Ouch

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 16:52:12
Message-ID: 20071006165212.GE7190@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Alvaro Herrera" <alvherre(at)commandprompt(dot)com> writes:
>
> > Gregory Stark wrote:
> >
> >> So does the _() macro automatically recode it to the current server encoding?
> >
> > Well, I'm not sure if it's _(), elog() or what, but it does get recoded.
> > If I have a different client_encoding and get a NOTICE, then both the
> > server and client get a message in the corresponding encoding.
>
> Actually I was thinking about things like formatting.c which take localized
> strings and return them as data which can end up in the database. If they're
> in the wrong encoding then they'll be invalidly encoded strings in the
> database.

Oh, I didn't think of that. Let me see if I can get an invalid string
into the database that way.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 17:03:06
Message-ID: 4707BFCA.2040608@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
>> Actually I was thinking about things like formatting.c which take localized
>> strings and return them as data which can end up in the database. If they're
>> in the wrong encoding then they'll be invalidly encoded strings in the
>> database.
>>
>
> Oh, I didn't think of that. Let me see if I can get an invalid string
> into the database that way.
>
>

I was quite certain when we closed most of these holes recently that we
hadn't caught them all, so this wouldn't surprise me in the least.

cheers

andrew


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 17:43:13
Message-ID: 20071006174313.GF7190@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
>
>
> Alvaro Herrera wrote:
>>> Actually I was thinking about things like formatting.c which take
>>> localized
>>> strings and return them as data which can end up in the database. If
>>> they're
>>> in the wrong encoding then they'll be invalidly encoded strings in the
>>> database.
>>
>> Oh, I didn't think of that. Let me see if I can get an invalid string
>> into the database that way.
>
> I was quite certain when we closed most of these holes recently that we
> hadn't caught them all, so this wouldn't surprise me in the least.

It seems to work correctly:

alvherre=# drop table week;
DROP TABLE
alvherre=# create table week (a text);
CREATE TABLE
alvherre=# \encoding utf8
alvherre=# insert into week select to_char(now()-'3 days'::interval, 'tmday');
INSERT 0 1
alvherre=# \encoding latin1
alvherre=# insert into week select to_char(now()-'3 days'::interval, 'tmday');
INSERT 0 1
alvherre=# select * from week;
a
-----------
miércoles
miércoles
(2 lignes)

I tried on both a UTF8 and Latin1 terminal and it works OK in all cases.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-06 18:24:28
Message-ID: 27167.1191695068@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I tried on both a UTF8 and Latin1 terminal and it works OK in all cases.

The cases that would be interesting involve to_char's locale-specific
format codes (eg Dy) along with LC_TIME settings that are deliberately
incompatible with the database encoding. client_encoding is not relevant.

It's not real clear to me whether, on a Unix machine, there is even
supposed to be any difference between setting LC_TIME=es_ES.iso88591 and
setting it to es_ES.utf8. Since nl_langinfo(CODESET) is supposedly
determined only by LC_CTYPE, you could argue that strftime's results
should be in that encoding regardless, and that the codeset component of
other LC_ variables should be ignored. Some experimentation suggests
that at least in glibc it doesn't work that way, and that there is in
fact no principled way for you to find out what encoding strftime is
giving you :-(.

$ LANG=es_ES.utf8 date
sb oct 6 14:11:30 EDT 2007
$ LANG=es_ES.iso88591 date
sb oct 6 14:11:42 EDT 2007
$ LANG=en_US.iso88591 LC_TIME=es_ES.utf8 date
sb oct 6 14:12:10 EDT 2007
$ LC_CTYPE=en_US.iso88591 LC_TIME=es_ES.utf8 date
sb oct 6 14:12:34 EDT 2007

Perhaps a workable fix for this would be to try to mangle the LC_ settings
we pass to setlocale() so that they all have the same codeset component
(if any). It looks like the convention of ".foo" being a codeset name
is fairly well standardized, even if the spelling of the codeset name is
not ...

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-07 00:26:04
Message-ID: 87hcl3u5wj.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Since nl_langinfo(CODESET) is supposedly determined only by LC_CTYPE, you
> could argue that strftime's results should be in that encoding regardless,

It seems to me we aren't actually using strftime any more in any case. We seem
to be using things like _("Monday") instead. Except in my tests I don't get
any French dates even when the server is started in French mode. I think we
just don't have localizations for those strings yet.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-07 01:16:47
Message-ID: 2489.1191719807@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Since nl_langinfo(CODESET) is supposedly determined only by LC_CTYPE, you
>> could argue that strftime's results should be in that encoding regardless,

> It seems to me we aren't actually using strftime any more in any case.

Sorry, I was using strftime as a generic standin for "everything that
LC_TIME affects". Trace the usage of backend/utils/adt/pg_locale.c
to see what's really at stake there.

The practical issues would likely be things like type money using a
currency symbol that's given in the wrong encoding.

And of course you did get the point that we already know a bogus
LC_MESSAGES setting leads directly to error-stack-overflow PANIC.

regards, tom lane


From: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Encoding and i18n
Date: 2007-10-07 04:10:19
Message-ID: 47085C2B.6070001@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:

> It seems to me we aren't actually using strftime any more in any case. We seem
> to be using things like _("Monday") instead. Except in my tests I don't get
> any French dates even when the server is started in French mode. I think we
> just don't have localizations for those strings yet.
>
This was already discussed [1]. I proposed a patch (that was rejected)
because it calls setlocale() in every template pattern in to_char()
IIRC. I coded a patch to implement the setlocale() caching mechanism but
didn't send it. :( I'll take a look and this.

[1] http://archives.postgresql.org/pgsql-hackers/2006-11/msg00523.php

--
Euler Taveira de Oliveira
http://www.timbira.com/


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Subject: Re: Encoding and i18n
Date: 2007-10-08 10:40:14
Message-ID: 200710081240.14867.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Sonntag, 7. Oktober 2007 schrieb Gregory Stark:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> > Since nl_langinfo(CODESET) is supposedly determined only by LC_CTYPE, you
> > could argue that strftime's results should be in that encoding
> > regardless,
>
> It seems to me we aren't actually using strftime any more in any case. We
> seem to be using things like _("Monday") instead.

I seem to recall that we don't use strftime *yet*, exactly because of this
sort of issue. This was discussed before the 8.2 release.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: Encoding and i18n
Date: 2007-10-08 10:41:41
Message-ID: 200710081241.42433.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Samstag, 6. Oktober 2007 schrieb Tom Lane:
> It's not real clear to me whether, on a Unix machine, there is even
> supposed to be any difference between setting LC_TIME=es_ES.iso88591 and
> setting it to es_ES.utf8.  Since nl_langinfo(CODESET) is supposedly
> determined only by LC_CTYPE, you could argue that strftime's results
> should be in that encoding regardless, and that the codeset component of
> other LC_ variables should be ignored.  Some experimentation suggests
> that at least in glibc it doesn't work that way, and that there is in
> fact no principled way for you to find out what encoding strftime is
> giving you :-(.

It might be useful to research whether that behavior is following the spec
(POSIX or whatever).

--
Peter Eisentraut
http://developer.postgresql.org/~petere/