Re: Locale + encoding combinations

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Trevor Talbot <quension(at)gmail(dot)com>
Cc: Dave Page <dpage(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Locale + encoding combinations
Date: 2007-10-12 14:45:10
Message-ID: 20071012144510.GH6334@svr2.hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 12, 2007 at 06:03:52AM -0700, Trevor Talbot wrote:
> On 10/12/07, Dave Page <dpage(at)postgresql(dot)org> wrote:
> > Tom Lane wrote
> > > That still leaves us with the problem of how to tell whether a locale
> > > spec is bad on Windows. Judging by your example, Windows checks whether
> > > the code page is present but not whether it is sane for the base locale.
> > > What happens when there's a mismatch --- eg, what encoding do system
> > > messages come out in?
> >
> > I'm not sure how to test that specifically, but it seems that accented
> > characters simply fall back to their undecorated equivalents if the
> > encoding is not appropriate, eg:
> >
> > Dave(at)SNAKE:~$ ./setlc French_France.1252
> > Locale: French_France.1252
> > The date is: sam. 01 of août 2007
> > Dave(at)SNAKE:~$ ./setlc French_France.28597
> > Locale: French_France.28597
> > The date is: sam. 01 of aout 2007
> >
> > (the encodings used there are WIN1252 and ISO8859-7 (Greek)).
> >
> > I'm happy to test further is you can suggest how I can figure out the
> > encoding actually output.
>
> The encoding output is the one you specified. Keep in mind,
> underneath Windows is mostly working with Unicode, so all characters
> exist and the locale rules specify their behavior there. The encoding
> is just the byte stream it needs to force them all into after doing
> whatever it does to them. As you've seen, it uses some sort of
> best-fit mapping I don't know the details of. (It will drop accent
> marks and choose characters with similar shape where possible, by
> default.)
>
> I think it's a bit more complex for input/transform cases where you
> operate on the byte stream directly without intermediate conversion to
> Unicode, which is why UTF-8 doesn't work as a codepage, but again I
> don't have the details nearby. I can try to do more digging if
> needed.

Just so the non-windows-savvy people get it.. When Windows documentation or
users refer to Unicode, they mean UTF-16.

//Magnus

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-10-12 14:45:58 Re: Including Snapshot Info with Indexes
Previous Message Magnus Hagander 2007-10-12 14:40:40 Re: pg_tablespace_size()