Quick Links

Re: multibyte charater set in levenshtein function

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: multibyte charater set in levenshtein function
Date:	2010-07-21 01:54:57
Message-ID:	AANLkTikk-kbbKT=JHRrstSZTpUuei9JHTj7-5FAFs8nQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> 2010/7/13 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
>> Anyway I think that overhead is not ignorable. That's why I have splited
>> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
>> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
>> levenshtein_less_equal_internal_mb.
>
> Thank you for good measurement! Then, it's reasonable to have multiple
> implementations. It also has documentation. I'll change status of the
> patch to "Ready for Committer".
>
> The patch is good enough except argument types for some functions.
> For example:
> - char* vs. const char*
> - text* vs. const char* + length
> I hope committers would check whether there are better types for them.

This patch still needs some work. It includes a bunch of stylistic
changes that aren't relevant to the purpose of the patch. There's no
reason that I can see to change the existing levenshtein_internal
function to take text arguments instead of char *, or to change !m to
m == 0 in existing code, or to change the whitespace in the comments
of that function. All of those need to be reverted before we can
consider committing this.

There is a huge amount of duplicated code here. I think it makes
sense to have the multibyte version of the function be separate, but
perhaps we can merge the less-than-or-equal to bits into the main
code, so that we only have two copies instead of four. Perhaps we
can't just add a max_d argument max_distance to levenshtein_internal;
and if this value is >=0 then it represents the max allowable
distance, but if it is <0 then there is no limit. Sure, that might
slow down the existing code a bit, but it might not be significant.
I'd at least like to see some numbers showing that it is significant
before we go to this much trouble.

The code doesn't follow the project coding style. Braces must be
uncuddled. Comment paragraphs will be reflowed unless they begin and
end with ------. Function definitions should have the type
declaration on one line and the function name at the start of the
next.

Freeing memory with pfree is likely a waste of time; is there any
reason not to just rely on the memory context reset, as the original
coding did?

I think we might need to remove the acknowledgments section from this
code. If everyone who touches this code adds their name, we're
quickly going to have a mess. If we're not going to remove the
acknowledgments section, then please add my name, too, because I've
already patched this code once...

I'm going to set this back to "Waiting on Author".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Re: multibyte charater set in levenshtein function at 2010-07-20 07:37:12 from Itagaki Takahiro

Responses

Re: multibyte charater set in levenshtein function at 2010-07-21 11:40:33 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amber	2010-07-21 02:52:04	pg_config problem on Solaris 10u7 X64
Previous Message	Robert Haas	2010-07-21 01:00:00	Re: Query optimization problem