Quick Links

Re: Continuing encoding fun....

Lists:	pgsql-odbc

From:	"Dave Page" <dpage(at)vale-housing(dot)co(dot)uk>
To:	<pgsql-odbc(at)postgresql(dot)org>
Cc:	"Hiroshi Saito" <saito(at)inetrt(dot)skcapi(dot)co(dot)jp>, "Marko Ristola" <Marko(dot)Ristola(at)kolumbus(dot)fi>, "Johann Zuschlag" <zuschlag2(at)online(dot)de>
Subject:	Continuing encoding fun....
Date:	2005-09-03 19:47:38
Message-ID:	E7F85A1B5FF8D44C8A1AF6885BC9A0E4AC9E4A@ratbert.vale-housing.co.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

I've been thinking about this whilst getting dragged round the shops
today, and having read Marko's, Johann's, Hiroshi's and other emails,
not to mention bits of the ODBC spec, here's where I think we stand.

1) The current driver works as expected with Unicode apps.

2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
functions to the Unicode ones, and because (as I think Marko pointed
out) the basic latin chars map directly into the lower Unicode
characters (see http://www.unicode.org/charts/PDF/U0000.pdf).

3) Some other single byte LATIN encodings do not work. This is because
the characters do not map directly into Unicode 80-FF
(http://www.unicode.org/charts/PDF/U0080.pdf).

4) Multibyte apps do not work. I believe that in fact they never will
with a Unicode driver, because multibyte characters simply won't map
into Unicode in the same way that ASCII does. The user cannot opt to use
the non-wide functions, because the DM automatically maps them to the
Unicode versions.

Because the Driver Manager forces the user to use the *W functions if
they exist, I cannot see any way to make 3 or 4 work with a Unicode
driver. If we were to try to detect what encoding to use based on the OS
settings and convert on the fly, we would most likely break any apps
that try to do the right thing by using Unicode themselves. Does that
sound reasonable?

Therefore, it seems to me that the only thing to do is to reinstate the
#ifdef UNICODE preprocessor definitions in the source code (that I now
with I hadn't removed!), and ship 2 versions of the driver - a Unicode
one, and an ANSI/Multibyte version (ie. What 07.xx was).

Thoughts/comments? Hiroshi, what do other vendors do for the Japanese
market?

Regards, Dave.

From:	Marc Herbert <Marc(dot)Herbert(at)emicnetworks(dot)com>
To:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Continuing encoding fun....
Date:	2005-09-07 18:16:03
Message-ID:	87acio7nh8.fsf@meije.emic.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

"Dave Page" <dpage(at)vale-housing(dot)co(dot)uk> writes:

> I've been thinking about this whilst getting dragged round the shops
> today, and having read Marko's, Johann's, Hiroshi's and other emails,
> not to mention bits of the ODBC spec, here's where I think we stand.
>
> 1) The current driver works as expected with Unicode apps.
>
> 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
> functions to the Unicode ones, and because (as I think Marko pointed
> out) the basic latin chars map directly into the lower Unicode
> characters (see http://www.unicode.org/charts/PDF/U0000.pdf).
>
> 3) Some other single byte LATIN encodings do not work. This is because
> the characters do not map directly into Unicode 80-FF
> (http://www.unicode.org/charts/PDF/U0080.pdf).
>
> 4) Multibyte apps do not work. I believe that in fact they never will
> with a Unicode driver, because multibyte characters simply won't map
> into Unicode in the same way that ASCII does. The user cannot opt to use
> the non-wide functions, because the DM automatically maps them to the
> Unicode versions.
>
> Because the Driver Manager forces the user to use the *W functions if
> they exist, I cannot see any way to make 3 or 4 work with a Unicode
> driver. If we were to try to detect what encoding to use based on the OS
> settings and convert on the fly, we would most likely break any apps
> that try to do the right thing by using Unicode themselves.

In a perfect world there are no "unicode apps", the internal encoding
is set by the system, properly written apps use abstract TCHAR/wchar_t
characters without knowing anything about what encoding they use, and
programs communicating with the outside (such as an database driver),
should query the system encoding using something like "setlocale()",
and perform any appropriate conversion on the fly.

Excerpt from "info libc - Character Set Handling" of GNU libc 2.3.2

<http://www.gnu.org/software/libc/manual/html_node/Character-Set-Handling.html>

The question remaining is: how to select the character set or
encoding to use. The answer: you cannot decide about it yourself,
it is decided by the developers of the system or the majority of the
users. Since the goal is interoperability one has to use whatever
the other people one works with use.

<http://www.faqs.org/docs/Linux-HOWTO/Unicode-HOWTO.html#s6>
says the same thing:

"Avoid direct access with Unicode. This is a task of the platform's
internationalization framework."

Of course those two quotes are targeted at applications
developers. They imply that some driver communicating with the outside
world/database should carry any conversion task.

However, I have no idea how this theory is far from reality, far from
the ODBC API, and far from Windows, sorry :-( I just was woken up by
the "unicode apps" word. I tried to follow the discussions here but
got lost.

My 2 cents.

From:	Marko Ristola <Marko(dot)Ristola(at)kolumbus(dot)fi>
To:
Cc:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Continuing encoding fun....
Date:	2005-09-08 17:38:38
Message-ID:	4320771E.4070400@kolumbus.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

There is one thing, that might be good for you to know:

I tried
wprintf("%s",char_text) and printf("%ls",wchar_text) methods.
They don't work with LATIN1 under Linux.

gcc does not support NON-ASCII multibyte conversions.
gcc gives that responsibility for library functions.

That is so even for GCC 4.0.

So, at least libiconv is a good way to handle the multibyte conversions
robustly under Linux. That works if and only if the libiconv library works.

libiconv is LGPL licensed.

Regards,
Marko Ristola

>However, I have no idea how this theory is far from reality, far from
>the ODBC API, and far from Windows, sorry :-( I just was woken up by
>the "unicode apps" word. I tried to follow the discussions here but
>got lost.
>
>

From:	Marc Herbert <Marc(dot)Herbert(at)emicnetworks(dot)com>
To:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Continuing encoding fun....
Date:	2005-09-13 20:11:24
Message-ID:	87wtlk4tjn.fsf@mail.emicnetworks.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

Marko Ristola <Marko(dot)Ristola(at)kolumbus(dot)fi> writes:

> There is one thing, that might be good for you to know:
>
> I tried
> wprintf("%s",char_text) and printf("%ls",wchar_text) methods.
> They don't work with LATIN1 under Linux.

What do you mean by that? Could you post a short sample code?

Since wchar_t is 32bits for glibc, wchar_text can not be LATIN1 which
is 8bits long...

> gcc does not support NON-ASCII multibyte conversions.

Well I would find weird for a compiler to perform such conversions.

From:	Marc Herbert <Marc(dot)Herbert(at)continuent(dot)com>
To:	pgsql-odbc(at)postgresql(dot)org
Subject:	Re: Continuing encoding fun....
Date:	2005-11-21 17:19:15
Message-ID:	87hda52a24.fsf@meije.emic.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-odbc

"Dave Page" <dpage(at)vale-housing(dot)co(dot)uk> writes:

I agree that 4) can never work, because ODBC does not seem compatible
with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
strings, that's all.
<http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx>

However, I don't get why 3) does not work. From here:
<http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcunicode_function_arguments.asp>

If the driver is a Unicode driver, the Driver Manager makes function
calls as follows:
- Converts an ANSI function (with the A suffix) to a Unicode function
(with the W suffix) by converting the string arguments into Unicode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
characters and passes the Unicode function to the driver.

Are you saying in 3) that the "converting" underlined above is
actually just a static cast?!

Is this "bug" true for every driver manager out there?