Re: multibyte charater set in levenshtein function

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 11:40:33
Message-ID: AANLkTiltvmu_iNzyl5ZHBKrdSgVA2yNoDpdgZfy9wOxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> This patch still needs some work. It includes a bunch of stylistic
> changes that aren't relevant to the purpose of the patch. There's no
> reason that I can see to change the existing levenshtein_internal
> function to take text arguments instead of char *, or to change !m to
> m == 0 in existing code, or to change the whitespace in the comments
> of that function. All of those need to be reverted before we can
> consider committing this.
>
I changed arguments of function from char * to text * in order to avoid
text_to_cstring call. Same benefit can be achived by replacing char * with
char * and length.
I changed !m to m == 0 because Itagaki asked me to make it conforming coding
style. Do you think there is no reason to fix coding style in existing
code?

> There is a huge amount of duplicated code here. I think it makes
> sense to have the multibyte version of the function be separate, but
> perhaps we can merge the less-than-or-equal to bits into the main
> code, so that we only have two copies instead of four. Perhaps we
> can't just add a max_d argument max_distance to levenshtein_internal;
> and if this value is >=0 then it represents the max allowable
> distance, but if it is <0 then there is no limit. Sure, that might
> slow down the existing code a bit, but it might not be significant.
> I'd at least like to see some numbers showing that it is significant
> before we go to this much trouble.
>
In these case we should add many checks of max_d in levenshtein_internal
function which make code more complex.
Actually, we can merge all four functions into one function. But such
function will have many checks about multibyte encoding and max_d. So, I see
four cases here:
1) one function with checks for multibyte encoding and max_d
2) two functions with checks for multibyte encoding
3) two functions with checks for max_d
4) four separate functions
If you prefer case number 3 you should argue your position little more.

> The code doesn't follow the project coding style. Braces must be
> uncuddled. Comment paragraphs will be reflowed unless they begin and
> end with ------. Function definitions should have the type
> declaration on one line and the function name at the start of the
> next.
>
> Freeing memory with pfree is likely a waste of time; is there any
> reason not to just rely on the memory context reset, as the original
> coding did?
>
Ok, I'll fix this things.

> I think we might need to remove the acknowledgments section from this
> code. If everyone who touches this code adds their name, we're
> quickly going to have a mess. If we're not going to remove the
> acknowledgments section, then please add my name, too, because I've
> already patched this code once...
>
In that case I think we can leave original acknowledgments section.

----
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2010-07-21 11:53:25 Re: managing git disk space usage
Previous Message Pavel Stehule 2010-07-21 11:39:15 Re: patch: to_string, to_array functions