Re: Win32 unicode vs ICU

Lists: pgsql-hackerspgsql-patches
From: "Magnus Hagander" <mha(at)sollentuna(dot)net>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>, "Palle Girgensohn" <girgen(at)pingpong(dot)net>
Subject: Re: Win32 unicode vs ICU
Date: 2005-08-23 10:28:35
Message-ID: 6BCB9D8A16AC4241919521715F4D8BCE6C78E3@algol.sollentuna.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

.. back home again after a couple of days ..

<snip>
> >> I am unsure of how to proceed. As I see it there are three paths:
> >> 1) Use native win32 functionality only on win32
> >> 2) Use ICU functionality only on win32
> >> 3) Allow both ICU and native functionality, compile time
> >> switch --with-icu (same as unix with the ICU patch)
<snip>

> I feel it makes sense to apply the smaller patch in any case,
> so that there's a Win32 solution not requiring ICU (ie, I
> can't see an argument for doing (2) rather than (3)).
>
> Comments?

Sounds reasonable to me - couldn't really find a reasonable argument for
(2), but it was an option :-)

Though as it seems to be needed on FreeBSD as well, we should definitly
look at (3) as a long-term option. Considering Palle has been running it
in production for quite a while now IIRC, the ICU part should be fairly
stable. but see below...

> > And anohter question - my native patch touches the same
> functions as
> > the ICU patch. Can somebody who knows the internals confirm or deny
> > that these are all the required locations, or do we need to modify
> > more?
>
> There is a strxfrm() call in
> src/backend/utils/adt/selfuncs.c, which probably needs to be
> looked at too.

Ok. Will look into that. Do you have a hint as to how to test that?
Considering I've been unable to show incorrect function without donig
it, and apparantly so has Palle (with the ICU patch) running it in
production for a long time, I clearly don't know how to provoke a
failure with the current code.

Which brings up another point - there are clearly no regression tests
for this (considering we missed the unicode stuff early in the 8.0
cycle). Is there a reasonable way to add something along this line that
could be used, or will that be too complex? (I guess we can never test
completely as it depends on the OS locale handling, but some?) I'm
thinking it might be too complex as we'd need a new initdb with a
different encoding, but perhaps it's worth it anyway?

//Magnus


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Magnus Hagander" <mha(at)sollentuna(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org, "Palle Girgensohn" <girgen(at)pingpong(dot)net>
Subject: Re: Win32 unicode vs ICU
Date: 2005-08-23 13:48:25
Message-ID: 1890.1124804905@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Magnus Hagander" <mha(at)sollentuna(dot)net> writes:
>> There is a strxfrm() call in
>> src/backend/utils/adt/selfuncs.c, which probably needs to be
>> looked at too.

> Ok. Will look into that. Do you have a hint as to how to test that?

Any problems would manifest as a bogus interpolation between histogram
elements for a scalar-inequality selectivity estimate in a text column.
For instance, if you insert all 676 2-letter combinations AA, AB, AC,
..., ZY, ZZ into a text column, ANALYZE, and then try cases like
"EXPLAIN SELECT * FROM tab WHERE col < 'QW'", ideally the row estimate
should be pretty nearly dead on. Being pure-ASCII this test would
probably still work in a broken Unicode context, but if you did a
similar experiment with 26 non-ASCII characters it would be likely to
come out with silly results. You could increase the obviousness of the
bad result by reducing the statistics target, since the silliness will
be bounded by the histogram bin size.

(Just looking at it again, the code in convert_string_to_scalar is
pretty bogus for multibyte encodings in any case. Possibly we need to
rethink the whole approach.)

> Which brings up another point - there are clearly no regression tests
> for this (considering we missed the unicode stuff early in the 8.0
> cycle).

src/test/locale? src/test/mb? I've never used either, but they're
there ...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Magnus Hagander" <mha(at)sollentuna(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, "Palle Girgensohn" <girgen(at)pingpong(dot)net>
Subject: Re: Win32 unicode vs ICU
Date: 2005-08-23 16:03:44
Message-ID: 3044.1124813024@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> (Just looking at it again, the code in convert_string_to_scalar is
> pretty bogus for multibyte encodings in any case. Possibly we need to
> rethink the whole approach.)

After studying this some more, I think the code is really so bogus for
any non-ASCII situation that it's probably not worth worrying about
too much. It's effectively assuming that the output of strxfrm() is
still in an ASCII-superset encoding ... but I don't see anything in
strxfrm's API that guarantees any such thing.

As long as strxfrm() doesn't fail completely for Windows Unicode,
I'd recommend just leaving this alone. As previously noted, the
worst that can happen is an estimation error that's bounded by the
histogram bin size anyhow.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Magnus Hagander" <mha(at)sollentuna(dot)net>
Cc: pgsql-patches(at)postgreSQL(dot)org, "Palle Girgensohn" <girgen(at)pingpong(dot)net>
Subject: Re: Win32 unicode vs ICU
Date: 2005-08-23 17:16:10
Message-ID: 3649.1124817370@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I looked over the proposed patch a bit and found some problems --- in
particular, if I read M$'s documentation about MultiByteToWideChar
correctly, they chose an API that fails for zero-length input, and
so you gotta program around that. Also, varstr_cmp() cannot assume
it gets null-terminated input.

I cannot test the attached revised patch; please check it out.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 5.9 KB

From: "Kevin McArthur" <postgresql-list(at)stormtide(dot)ca>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: FreeBSD ICU was Win32 unicode vs ICU
Date: 2005-08-24 16:58:59
Message-ID: 004601c5a8cd$1f9bbf70$0701a8c0@kdesktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I was reviewing this thread about its lack of collation support in freebsd.

As some of you may or may not know the PHP project is also currently working
heavily on unicode support. (For PHP6)

I had the chance to ask Andrei Zmievski of the php project about their
support for unicode. The key items are as follows.

<StormTide> with the new unicode support, is there any support for unicode
collation
<andrei> StormTide, there will be
<StormTide> is it imported by the platform or custom done for php
<StormTide> (cuz freebsd seeems to have issues with its collation support)
<andrei> StormTide, not OS-dependent
<andrei> StormTide, uses CLDR

Should the postgresql project also be looking at CLDR for cross-platform
unicode support?

http://www.unicode.org/cldr/

Kevin McArthur
Digifonica Canada

----- Original Message -----
From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Magnus Hagander" <mha(at)sollentuna(dot)net>
Cc: <pgsql-hackers(at)postgresql(dot)org>; "Palle Girgensohn" <girgen(at)pingpong(dot)net>
Sent: Tuesday, August 23, 2005 9:03 AM
Subject: Re: [HACKERS] Win32 unicode vs ICU

>I wrote:
>> (Just looking at it again, the code in convert_string_to_scalar is
>> pretty bogus for multibyte encodings in any case. Possibly we need to
>> rethink the whole approach.)
>
> After studying this some more, I think the code is really so bogus for
> any non-ASCII situation that it's probably not worth worrying about
> too much. It's effectively assuming that the output of strxfrm() is
> still in an ASCII-superset encoding ... but I don't see anything in
> strxfrm's API that guarantees any such thing.
>
> As long as strxfrm() doesn't fail completely for Windows Unicode,
> I'd recommend just leaving this alone. As previously noted, the
> worst that can happen is an estimation error that's bounded by the
> histogram bin size anyhow.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>