PostgreSQL 8.3.7: soundex function returns UTF-16 characters

Lists: pgsql-bugs
From: Frans <frans(at)geodan(dot)nl>
To: pgsql-bugs(at)postgresql(dot)org
Subject: PostgreSQL 8.3.7: soundex function returns UTF-16 characters
Date: 2009-04-06 15:07:38
Message-ID: 49DA1ABA.8090900@geodan.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hello,

We have just discovered a problem with the soundex function in
PostgreSQL 8.3.7. The problem is easy to reproduce. The following query
returns the ASCII code of the soundex representation of the Greek letter Pi:

select ascii (soundex('Π'));

In PostgreSQL 8.2.6 the result would be 0 (character null). In
PostgreSQL 8.3.7 the return value is 944, which is the UTF-16 code of
this letter. Because PostgreSQL does not support UTF-16, this behaviour
causes problems when the return values from the soundex function are
stored in a database column. For instance, creating an index on such a
column or doing VACUUM FULL on the table results in errors like:

ERROR: could not convert string to UTF-16

Regards,
Frans


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Frans <frans(at)geodan(dot)nl>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL 8.3.7: soundex function returns UTF-16 characters
Date: 2009-04-06 15:34:30
Message-ID: 6392.1239032070@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Frans <frans(at)geodan(dot)nl> writes:
> We have just discovered a problem with the soundex function in
> PostgreSQL 8.3.7. The problem is easy to reproduce. The following query
> returns the ASCII code of the soundex representation of the Greek letter Pi:

> select ascii (soundex(''));

> In PostgreSQL 8.2.6 the result would be 0 (character null). In
> PostgreSQL 8.3.7 the return value is 944, which is the UTF-16 code of
> this letter.

Hm, I take it you are working in database encoding utf8? The
fuzzystrmatch module doesn't really work with utf8 (nor any other
multibyte encoding), because it depends on the <ctype.h> functions.
What you'll probably get when applying it to non-ascii utf8 is
an invalidly encoded string.

This is a known limitation that probably should be better documented.
It was just as broken in 8.2 (and every previous version), though.

regards, tom lane


From: Frans <frans(at)geodan(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL 8.3.7: soundex function returns UTF-16 characters
Date: 2009-04-06 16:10:44
Message-ID: 49DA2984.3070904@geodan.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> Frans <frans(at)geodan(dot)nl> writes:
>
>> We have just discovered a problem with the soundex function in
>> PostgreSQL 8.3.7. The problem is easy to reproduce. The following query
>> returns the ASCII code of the soundex representation of the Greek letter Pi:
>>
>
>
>> select ascii (soundex('Î '));
>>
>
>
>> In PostgreSQL 8.2.6 the result would be 0 (character null). In
>> PostgreSQL 8.3.7 the return value is 944, which is the UTF-16 code of
>> this letter.
>>
>
> Hm, I take it you are working in database encoding utf8?
That is correct. I should have mentioned it. It is the default encoding
we use because we often deal with non-English languages. And it is
because of multilingualism that the fuzzystrmatch functions are handy.
> The
> fuzzystrmatch module doesn't really work with utf8 (nor any other
> multibyte encoding), because it depends on the <ctype.h> functions.
> What you'll probably get when applying it to non-ascii utf8 is
> an invalidly encoded string.
>
Well, in 8.2.6 the result for non-ASCII UTF-8 was an empty string (ASCII
code 0). You could argue that this is a valid way of expressing that the
input string could not be processed (especially if it were documented).
The nice thing about this approach is that the result is valid ASCII
(and UTF-8).
> This is a known limitation that probably should be better documented.
> It was just as broken in 8.2 (and every previous version), though.
>
But it seems there has been a recent change in the handling of non-ASCII
strings. And the result of this change is that further handling or
storing of the function output has become more difficult.
> regards, tom lane
>

Best regards,
Frans


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Frans <frans(at)geodan(dot)nl>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL 8.3.7: soundex function returns UTF-16 characters
Date: 2009-04-06 17:18:48
Message-ID: 8241.1239038328@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Frans <frans(at)geodan(dot)nl> writes:
> Tom Lane wrote:
>> The
>> fuzzystrmatch module doesn't really work with utf8 (nor any other
>> multibyte encoding), because it depends on the <ctype.h> functions.
>> What you'll probably get when applying it to non-ascii utf8 is
>> an invalidly encoded string.
>>
> Well, in 8.2.6 the result for non-ASCII UTF-8 was an empty string (ASCII
> code 0).

A comparison of the 8.2 and 8.3 fuzzystrmatch sources shows no
difference. The behavior of the ascii() function has indeed changed,
but soundex() is no more nor less broken than it was before.

[ thinks for a bit... ] If you are seeing a difference in what soundex
itself does, the most likely explanation is a difference in the behavior
of isalpha() or perhaps toupper(). Are you running on the same
underlying C library as before? Are you quite sure you have the same
encoding and locale selected?

regards, tom lane


From: Frans <frans(at)geodan(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL 8.3.7: soundex function returns UTF-16 characters
Date: 2009-04-07 10:23:38
Message-ID: 49DB29AA.5050502@geodan.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> Frans <frans(at)geodan(dot)nl> writes:
>
>> Tom Lane wrote:
>>
>>> The
>>> fuzzystrmatch module doesn't really work with utf8 (nor any other
>>> multibyte encoding), because it depends on the <ctype.h> functions.
>>> What you'll probably get when applying it to non-ascii utf8 is
>>> an invalidly encoded string.
>>>
>>>
>> Well, in 8.2.6 the result for non-ASCII UTF-8 was an empty string (ASCII
>> code 0).
>>
>
> A comparison of the 8.2 and 8.3 fuzzystrmatch sources shows no
> difference. The behavior of the ascii() function has indeed changed,
> but soundex() is no more nor less broken than it was before.
>
> [ thinks for a bit... ] If you are seeing a difference in what soundex
> itself does, the most likely explanation is a difference in the behavior
> of isalpha() or perhaps toupper(). Are you running on the same
> underlying C library as before? Are you quite sure you have the same
> encoding and locale selected?
>

Thank you for pointing me in the right direction. I have done some more
research now.. I have installed 8.2.13 and 8.3.7 on the same
workstation, selecting locale=C and character encoding=UTF-8 in both
cases. In both cases soundex() behaved as desired, i.e. it produces a
null string if it can not handle the input. It looks like the difference
in behaviour I noticed was not caused by the different PostgreSQL
versions after all, but by a different locale setting. I see (in
postgresql.ini) that for the database in which soundex() produces the
'wrong' output the locale apparently was set to 'Dutch_Netherlands'. I
can not recall consciously selecting this locale but it might have been
set by the installer. Does it make sense that the locale setting
influences the workings of the soundex function?

In the database where I noticed the undesired soundex() behaviour I did
a further test, using the bit_length() function. The query "(select
bit_length(soundex('?'))" returns a value of 0 where ascii() also
returns 0 but it returns a value of 32 in the other case (where ascii()
returns 944). So it seems soundex() really has a different output in
both cases.

I don't know now if this issue should still be regarded as a bug.. At
least it seems to me that the locale setting is also affecting the
soundex function should be documented.

> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Frans <frans(at)geodan(dot)nl>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL 8.3.7: soundex function returns UTF-16 characters
Date: 2009-04-07 14:06:21
Message-ID: 802.1239113181@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Frans <frans(at)geodan(dot)nl> writes:
> Does it make sense that the locale setting
> influences the workings of the soundex function?

Yeah, it absolutely would, because soundex depends on the C library's
isalpha() and toupper() functions, and those are influenced by locale.

It is clear from looking at the code that soundex isn't expecting
isalpha() to return true for anything except the ASCII letters A-Z,a-z.
That's true in the standard C locale but typically not true in others.
In your example with pi, I think the code would've indexed off the end
of its letter array and gotten unpredictable results. We could/should
tighten that up, I think, even if we're not willing to rewrite the
code for full multibyte support just yet.

regards, tom lane