dmetaphone woes

Lists: pgsql-hackers
From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: dmetaphone woes
Date: 2010-04-05 01:42:23
Message-ID: 4BB93FFF.6000504@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


While testing pgindent the other day, I found some infelicities in
contrib/fuzzystrmatch/dmetaphone.c. From pgindent's point of view, the
problem is that the code contains two characters in case labels with the
high bits set, and this blows pgindent up on my Linux box if the locale
happens be en_US.utf8 instead of C. Now, we can fix that easily enough
by replacing those characters with the equivalent hexadecimal escapes.

However, that doesn't solve the fundamental problem, which is that the
code in question is pretty much broken for any encoding but Latin1. (In
my defence I plead that when I created the module, by porting code from
a perl module, I was working with pure ASCII data and was much more
ignorant than I am now about encoding issues.) The rest of the code
deals in pure ASCII characters, and so it should be safe, I think.

I'm not exactly sure why the algorithm treats these two characters
(U+00C7 and U+00D1, C with a cedilla, and N with a tilde respectively)
specially.

The code has been there for some time, and nobody has bitched about it
that I know of, so I'm not in a hurry to fix it, unless people think we
should do that before 9.0. making the code properly encoding aware would
probably involve a non-trivial amount of surgery. If not, I'm inclined
to fix the issue that affects pgindent, and leave the rest as a TODO
item for 9.1.

Thoughts?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dmetaphone woes
Date: 2010-04-05 02:04:21
Message-ID: 422.1270433061@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> While testing pgindent the other day, I found some infelicities in
> contrib/fuzzystrmatch/dmetaphone.c. From pgindent's point of view, the
> problem is that the code contains two characters in case labels with the
> high bits set, and this blows pgindent up on my Linux box if the locale
> happens be en_US.utf8 instead of C.

Not only pgindent ...
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00308.php

> However, that doesn't solve the fundamental problem, which is that the
> code in question is pretty much broken for any encoding but Latin1.

Yeah. I don't see an easy fix for it either, but there should be a
TODO entry about it. In the meantime I'm surprised we didn't insert
octal escapes already.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dmetaphone woes
Date: 2010-04-05 02:56:33
Message-ID: 4BB95161.2090602@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>> However, that doesn't solve the fundamental problem, which is that the
>> code in question is pretty much broken for any encoding but Latin1.
>>
>
> Yeah. I don't see an easy fix for it either, but there should be a
> TODO entry about it. In the meantime I'm surprised we didn't insert
> octal escapes already.
>
>
>

Escapes done, TODO added.

cheers

andrew