Re: Locale + encoding combinations

Lists: pgsql-hackers
From: Dave Page <dpage(at)postgresql(dot)org>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Locale + encoding combinations
Date: 2007-10-09 20:32:42
Message-ID: 470BE56A.2030509@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm working on some code for pgInstaller that will check the locale and
encoding selected by the user are a valid combination.

The changes recently added to initdb (which highlighted the UTF-8 issue
on Windows that Tom posted about) appear to only allow the default
encoding for the locale to be selected. For example, for me that would be:

"English_United Kingdom.1252"

However, setlocale() will also accept other valid combinations on
Windows, which initdb will not, for example:

"English_United Kingdom.28591" (Latin1)

Is there any reason not to accept other combinations that setlocale() is
happy with?

Regards, Dave


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Dave Page <dpage(at)postgresql(dot)org>
Subject: Re: Locale + encoding combinations
Date: 2007-10-09 21:27:08
Message-ID: 200710092327.09090.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page wrote:
> Is there any reason not to accept other combinations that setlocale()
> is happy with?

setlocale() sets the locale. How does it "accept" a "combination"?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Dave Page <dpage(at)postgresql(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-09 21:38:47
Message-ID: 470BF4E7.7050107@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Dave Page wrote:
>> Is there any reason not to accept other combinations that setlocale()
>> is happy with?
>
> setlocale() sets the locale. How does it "accept" a "combination"?
>

setlocale(LC_CTYPE, "English_United Kingdom.65001")

will return null (and not change anything) because it doesn't like the
combination of the locale and that encoding (UTF-8).

setlocale(LC_CTYPE, "English_United Kingdom.1252")

will return "English_United Kingdom.1252" and set the locale accordingly
because WIN1252 is a valid encoding for that locale. Similarly, LATIN1
and numerous other encodings are accepted in combination with that locale.

Should initdb allow any combination that setlocale() accepts, or should
it *only* accept the default encoding for the specified locale?

/D


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-09 22:55:31
Message-ID: 200710100055.31624.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page wrote:
> setlocale(LC_CTYPE, "English_United Kingdom.65001")
>
> will return null (and not change anything) because it doesn't like
> the combination of the locale and that encoding (UTF-8).

The reason that that call fails is probably that the operating system
does not provide such a locale. But that's not what we are interested
in. We are interested in compatibility between *existing* operating
system locales and *PostgreSQL* encoding names.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Dave Page <dpage(at)postgresql(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 08:11:21
Message-ID: 470C8929.3030803@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Dave Page wrote:
>> setlocale(LC_CTYPE, "English_United Kingdom.65001")
>>
>> will return null (and not change anything) because it doesn't like
>> the combination of the locale and that encoding (UTF-8).
>
> The reason that that call fails is probably that the operating system
> does not provide such a locale.

It doesn't - UTF-8/65001 is a pseudo codepage on Windows with no NLS
file defining collation rules etc. as we already discussed.

> But that's not what we are interested
> in. We are interested in compatibility between *existing* operating
> system locales and *PostgreSQL* encoding names.

Yes.

Let me put my question another way.

Latin1 is a perfectly valid encoding for my locale English_United
Kingdom. It is accepted by setlocale for LC_ALL.

Why does initdb reject it? Why does it insist the encoding is not valid
for the locale?

/D


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 08:45:47
Message-ID: 200710101045.47873.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
> Latin1 is a perfectly valid encoding for my locale English_United
> Kingdom. It is accepted by setlocale for LC_ALL.
>
> Why does initdb reject it? Why does it insist the encoding is not valid
> for the locale?

Because initdb works with a finite list of known matches, and your particular
combination might not be in that list -- yet.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Dave Page <dpage(at)postgresql(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 08:51:33
Message-ID: 470C9295.6060805@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>> Latin1 is a perfectly valid encoding for my locale English_United
>> Kingdom. It is accepted by setlocale for LC_ALL.
>>
>> Why does initdb reject it? Why does it insist the encoding is not valid
>> for the locale?
>
> Because initdb works with a finite list of known matches, and your particular
> combination might not be in that list -- yet.

So is it just a case of us generating a list of matches that may be
Windows specific, or is there more to it than that?

/D


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 09:47:08
Message-ID: 200710101147.09291.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
> So is it just a case of us generating a list of matches that may be
> Windows specific, or is there more to it than that?

You want to peruse src/port/chklocale.c. There is already explicit Windows
support in there, so maybe you just need to add on your particular cases.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Dave Page <dpage(at)postgresql(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 11:07:58
Message-ID: 470CB28E.70500@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>> So is it just a case of us generating a list of matches that may be
>> Windows specific, or is there more to it than that?
>
> You want to peruse src/port/chklocale.c. There is already explicit Windows
> support in there, so maybe you just need to add on your particular cases.

Yup, found that - thanks. I'll look at updating that list.

/D


From: Dave Page <dpage(at)postgresql(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 11:48:15
Message-ID: 470CBBFF.5070701@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page wrote:
> Peter Eisentraut wrote:
>> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>>> So is it just a case of us generating a list of matches that may be
>>> Windows specific, or is there more to it than that?
>> You want to peruse src/port/chklocale.c. There is already explicit Windows
>> support in there, so maybe you just need to add on your particular cases.
>
> Yup, found that - thanks. I'll look at updating that list.

OK so I added the appropriate entries (and posted the patch to
-patches), but my original question remains: why can I only select the
*default* encoding for the chosen locale, but not other ones that are
also be valid according to setlocale? Is this a bug, or is there some
technical reason?

/D


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 12:18:16
Message-ID: 200710101418.16913.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
> my original question remains: why can I only select the
> *default* encoding for the chosen locale, but not other ones that are
> also be valid according to setlocale? Is this a bug, or is there some
> technical reason?

One locale works only with one encoding. There are no "default" or perhaps
alternative encodings for one locale; there is only one. The whole point of
the exercise is to determine what the spelling of that one encoding is in
PostgreSQL.

Perhaps you are confused about the naming. These are all entirely separate
locales:

en_GB.iso88591
en_GB.iso885915
en_GB.utf8

Someone was friendly enough to include the name of the encoding used by the
locale into its name, but that doesn't mean that en_GB has three alternative
encodings or something.

At least that's the model we have on POSIX platforms.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 12:37:48
Message-ID: 13830.1192019868@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page <dpage(at)postgresql(dot)org> writes:
> However, setlocale() will also accept other valid combinations on
> Windows, which initdb will not, for example:
> "English_United Kingdom.28591" (Latin1)
> Is there any reason not to accept other combinations that setlocale() is
> happy with?

Are you certain that that acceptance actually represents support?
Have you checked that it rejects combinations involving real code
pages (ie, NOT 65001) that don't really work with the locale?

regards, tom lane


From: Dave Page <dpage(at)postgresql(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 12:49:57
Message-ID: 470CCA75.1070108@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>> my original question remains: why can I only select the
>> *default* encoding for the chosen locale, but not other ones that are
>> also be valid according to setlocale? Is this a bug, or is there some
>> technical reason?
>
> One locale works only with one encoding. There are no "default" or perhaps
> alternative encodings for one locale; there is only one. The whole point of
> the exercise is to determine what the spelling of that one encoding is in
> PostgreSQL.
>
> Perhaps you are confused about the naming. These are all entirely separate
> locales:
>
> en_GB.iso88591
> en_GB.iso885915
> en_GB.utf8
>
> Someone was friendly enough to include the name of the encoding used by the
> locale into its name, but that doesn't mean that en_GB has three alternative
> encodings or something.
>
> At least that's the model we have on POSIX platforms.

OK, sorting out my terminology deficiencies has helped - thanks. The
problem seems to be:

initdb --locale "English_United Kingdom.28591"

works, but

initdb -E LATIN1 --locale "English_United Kingdom"

does not. That's good (albeit inconsistent), I know how to fix
pgInstaller now. What isn't so good is:

============
C:\pg>bin\initdb --locale "English_United Kingdom.99999" -D data
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
The files belonging to this database system will be owned by user "Dave".
This user must also own the server process.

The database cluster will be initialized with locale English_United
Kingdom.1252
.
The default database encoding has accordingly been set to WIN1252.
===========

Shouldn't that have failed?

Regards, Dave


From: Dave Page <dpage(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 12:58:03
Message-ID: 470CCC5B.6010706@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Dave Page <dpage(at)postgresql(dot)org> writes:
>> However, setlocale() will also accept other valid combinations on
>> Windows, which initdb will not, for example:
>> "English_United Kingdom.28591" (Latin1)
>> Is there any reason not to accept other combinations that setlocale() is
>> happy with?
>
> Are you certain that that acceptance actually represents support?
> Have you checked that it rejects combinations involving real code
> pages (ie, NOT 65001) that don't really work with the locale?

It fails with ones that Microsoft have decided don't belong in my
language group and therefore aren't installed. It accepts all the others
I've tried, but then from the sample I've looked, they all have
0-9a-zA-Z in them so I guess they're all capable of handling English.

Regards, Dave.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 13:46:56
Message-ID: 15261.1192024016@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page <dpage(at)postgresql(dot)org> writes:
> Tom Lane wrote:
>> Are you certain that that acceptance actually represents support?
>> Have you checked that it rejects combinations involving real code
>> pages (ie, NOT 65001) that don't really work with the locale?

> It fails with ones that Microsoft have decided don't belong in my
> language group and therefore aren't installed. It accepts all the others
> I've tried, but then from the sample I've looked, they all have
> 0-9a-zA-Z in them so I guess they're all capable of handling English.

That doesn't exactly fill me with confidence. Maybe you need to make
some tests involving a non-English base locale?

regards, tom lane


From: Dave Page <dpage(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 14:08:36
Message-ID: 470CDCE4.4010009@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Dave Page <dpage(at)postgresql(dot)org> writes:
>> Tom Lane wrote:
>>> Are you certain that that acceptance actually represents support?
>>> Have you checked that it rejects combinations involving real code
>>> pages (ie, NOT 65001) that don't really work with the locale?
>
>> It fails with ones that Microsoft have decided don't belong in my
>> language group and therefore aren't installed. It accepts all the others
>> I've tried, but then from the sample I've looked, they all have
>> 0-9a-zA-Z in them so I guess they're all capable of handling English.
>
> That doesn't exactly fill me with confidence. Maybe you need to make
> some tests involving a non-English base locale?

Hmm, I'm guessing these probably shouldn't work:

Dave(at)SNAKE:~$ setlc "Japanese_Japan.28605"
Japanese_Japan.28605
Dave(at)SNAKE:~$ setlc "Japanese_Japan.28595"
Japanese_Japan.28595
Dave(at)SNAKE:~$ setlc "Russian_Russia.1252"
Russian_Russia.1252
Dave(at)SNAKE:~$ setlc "Russian_Russia.28591"
Russian_Russia.28591

1252 == WIN1252
28591 == LATIN1
28605 == LATIN9
28595 == ISO8859-5 (Cyrillic)
28597 == ISO8859-7 (Greek)

In fact, it looks like it'll allow me to use anything thats installed,
regardless of whether they're liekly to be compatible. So much for
trusting setlocale() :-(

/D


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 14:09:53
Message-ID: 15707.1192025393@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page <dpage(at)postgresql(dot)org> writes:
> OK so I added the appropriate entries (and posted the patch to
> -patches), but my original question remains: why can I only select the
> *default* encoding for the chosen locale, but not other ones that are
> also be valid according to setlocale? Is this a bug, or is there some
> technical reason?

Well, the chklocale code is designed around the assumption that there
*is* only one encoding for which a locale setting will work, with
C/POSIX being a special case.

I think we are talking a bit at cross-purposes here, because the Windows
equivalent to this notion seems to be "English_United Kingdom.1252"
whereas you seem to be defining locale as just "English_United Kingdom".
Does it not work the way you want if you make the installer pass locale
strings of the first form to initdb?

regards, tom lane


From: Dave Page <dpage(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 14:15:50
Message-ID: 470CDE96.8060500@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Dave Page <dpage(at)postgresql(dot)org> writes:
>> OK so I added the appropriate entries (and posted the patch to
>> -patches), but my original question remains: why can I only select the
>> *default* encoding for the chosen locale, but not other ones that are
>> also be valid according to setlocale? Is this a bug, or is there some
>> technical reason?
>
> Well, the chklocale code is designed around the assumption that there
> *is* only one encoding for which a locale setting will work, with
> C/POSIX being a special case.
>
> I think we are talking a bit at cross-purposes here, because the Windows
> equivalent to this notion seems to be "English_United Kingdom.1252"
> whereas you seem to be defining locale as just "English_United Kingdom".
> Does it not work the way you want if you make the installer pass locale
> strings of the first form to initdb?

Yes, it seems it does (see my previous email to Peter):
http://archives.postgresql.org/pgsql-hackers/2007-10/msg00447.php

So I guess that's how I'll fix the installer. There is another issue
though as I mentioned in the post above - that it complains about an
invalid encoding specifier on the encoding name, then ignores it and
uses the default which seems wrong to me.

/D


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 14:55:01
Message-ID: 16560.1192028101@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page <dpage(at)postgresql(dot)org> writes:
> In fact, it looks like it'll allow me to use anything thats installed,
> regardless of whether they're liekly to be compatible. So much for
> trusting setlocale() :-(

Yech :-(. Count on Microsloth to get this wrong. Anyone have any ideas
on how to tell if a locale setting *really* works on Windows?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dave Page <dpage(at)postgresql(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-10 15:15:58
Message-ID: 16980.1192029358@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page <dpage(at)postgresql(dot)org> writes:
> ... There is another issue
> though as I mentioned in the post above - that it complains about an
> invalid encoding specifier on the encoding name, then ignores it and
> uses the default which seems wrong to me.

Yeah, if you look at chklocale() in initdb.c this is clearly how it
works, but there's a comment
/* should we exit here? */
so whoever wrote it wasn't all that convinced it was the right behavior.

Given that 8.3 is raising the stakes for having a correct locale
specification at initdb time, it seems right to me to error out if a
bogus locale switch is given, rather than whining and then substituting
the environment default. Any objections?

That still leaves us with the problem of how to tell whether a locale
spec is bad on Windows. Judging by your example, Windows checks whether
the code page is present but not whether it is sane for the base locale.
What happens when there's a mismatch --- eg, what encoding do system
messages come out in?

regards, tom lane


From: Dave Page <dpage(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-12 11:53:18
Message-ID: 470F602E.9060800@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote
> That still leaves us with the problem of how to tell whether a locale
> spec is bad on Windows. Judging by your example, Windows checks whether
> the code page is present but not whether it is sane for the base locale.
> What happens when there's a mismatch --- eg, what encoding do system
> messages come out in?

I'm not sure how to test that specifically, but it seems that accented
characters simply fall back to their undecorated equivalents if the
encoding is not appropriate, eg:

Dave(at)SNAKE:~$ ./setlc French_France.1252
Locale: French_France.1252
The date is: sam. 01 of août 2007
Dave(at)SNAKE:~$ ./setlc French_France.28597
Locale: French_France.28597
The date is: sam. 01 of aout 2007

(the encodings used there are WIN1252 and ISO8859-7 (Greek)).

I'm happy to test further is you can suggest how I can figure out the
encoding actually output.

Regards, Dave.


From: "Trevor Talbot" <quension(at)gmail(dot)com>
To: "Dave Page" <dpage(at)postgresql(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-12 13:03:52
Message-ID: 90bce5730710120603t1d10b20ld689ef41b201026b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/07, Dave Page <dpage(at)postgresql(dot)org> wrote:
> Tom Lane wrote
> > That still leaves us with the problem of how to tell whether a locale
> > spec is bad on Windows. Judging by your example, Windows checks whether
> > the code page is present but not whether it is sane for the base locale.
> > What happens when there's a mismatch --- eg, what encoding do system
> > messages come out in?
>
> I'm not sure how to test that specifically, but it seems that accented
> characters simply fall back to their undecorated equivalents if the
> encoding is not appropriate, eg:
>
> Dave(at)SNAKE:~$ ./setlc French_France.1252
> Locale: French_France.1252
> The date is: sam. 01 of août 2007
> Dave(at)SNAKE:~$ ./setlc French_France.28597
> Locale: French_France.28597
> The date is: sam. 01 of aout 2007
>
> (the encodings used there are WIN1252 and ISO8859-7 (Greek)).
>
> I'm happy to test further is you can suggest how I can figure out the
> encoding actually output.

The encoding output is the one you specified. Keep in mind,
underneath Windows is mostly working with Unicode, so all characters
exist and the locale rules specify their behavior there. The encoding
is just the byte stream it needs to force them all into after doing
whatever it does to them. As you've seen, it uses some sort of
best-fit mapping I don't know the details of. (It will drop accent
marks and choose characters with similar shape where possible, by
default.)

I think it's a bit more complex for input/transform cases where you
operate on the byte stream directly without intermediate conversion to
Unicode, which is why UTF-8 doesn't work as a codepage, but again I
don't have the details nearby. I can try to do more digging if
needed.


From: Dave Page <dpage(at)postgresql(dot)org>
To: Trevor Talbot <quension(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-12 14:26:00
Message-ID: 470F83F8.5020503@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Trevor Talbot wrote:
> The encoding output is the one you specified.

OK.

> Keep in mind,
> underneath Windows is mostly working with Unicode, so all characters
> exist and the locale rules specify their behavior there. The encoding
> is just the byte stream it needs to force them all into after doing
> whatever it does to them. As you've seen, it uses some sort of
> best-fit mapping I don't know the details of. (It will drop accent
> marks and choose characters with similar shape where possible, by
> default.)

Right, that makes sense. The codepages used by setlocale etc. are just
translation tables to/from the internal unicode representation.

> I think it's a bit more complex for input/transform cases where you
> operate on the byte stream directly without intermediate conversion to
> Unicode, which is why UTF-8 doesn't work as a codepage, but again I
> don't have the details nearby. I can try to do more digging if
> needed.

It does (sort of) work as a codepage, it just doesn't have the NLS file
to define how things like UPPER() and LOWER() should work.

Regards, Dave


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Trevor Talbot <quension(at)gmail(dot)com>
Cc: Dave Page <dpage(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-12 14:45:10
Message-ID: 20071012144510.GH6334@svr2.hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 12, 2007 at 06:03:52AM -0700, Trevor Talbot wrote:
> On 10/12/07, Dave Page <dpage(at)postgresql(dot)org> wrote:
> > Tom Lane wrote
> > > That still leaves us with the problem of how to tell whether a locale
> > > spec is bad on Windows. Judging by your example, Windows checks whether
> > > the code page is present but not whether it is sane for the base locale.
> > > What happens when there's a mismatch --- eg, what encoding do system
> > > messages come out in?
> >
> > I'm not sure how to test that specifically, but it seems that accented
> > characters simply fall back to their undecorated equivalents if the
> > encoding is not appropriate, eg:
> >
> > Dave(at)SNAKE:~$ ./setlc French_France.1252
> > Locale: French_France.1252
> > The date is: sam. 01 of août 2007
> > Dave(at)SNAKE:~$ ./setlc French_France.28597
> > Locale: French_France.28597
> > The date is: sam. 01 of aout 2007
> >
> > (the encodings used there are WIN1252 and ISO8859-7 (Greek)).
> >
> > I'm happy to test further is you can suggest how I can figure out the
> > encoding actually output.
>
> The encoding output is the one you specified. Keep in mind,
> underneath Windows is mostly working with Unicode, so all characters
> exist and the locale rules specify their behavior there. The encoding
> is just the byte stream it needs to force them all into after doing
> whatever it does to them. As you've seen, it uses some sort of
> best-fit mapping I don't know the details of. (It will drop accent
> marks and choose characters with similar shape where possible, by
> default.)
>
> I think it's a bit more complex for input/transform cases where you
> operate on the byte stream directly without intermediate conversion to
> Unicode, which is why UTF-8 doesn't work as a codepage, but again I
> don't have the details nearby. I can try to do more digging if
> needed.

Just so the non-windows-savvy people get it.. When Windows documentation or
users refer to Unicode, they mean UTF-16.

//Magnus