Re: multibyte charater set in levenshtein function

Lists: pgsql-hackers
From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-12 04:46:25
Message-ID: AANLkTiksZIMvgthlIbu5S30OXA3Q2QnUNM0XUAr9ESQ_@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch.
https://commitfest.postgresql.org/action/patch_view?id=304

The main logic seems to be good, but I have some comments about
the coding style and refactoring.

* levenshtein_internal() and levenshtein_less_equal_internal() are very
similar. Can you merge the code? We can always use less_equal_internal()
if the overhead is ignorable. Did you compare them?

* There are many "if (!multibyte_encoding)" in levenshtein_internal().
How about split the function into two funcs for single-byte chars and
multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
use multi-byte version if the overhead is small.

* I prefer a struct rather than an array. "4 * m" and "3 * m" look like magic
numbers for me. Could you name the entries with definition of a struct?
/*
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of pg_mblen calls.
*/
prev = (int *) palloc(4 * m * sizeof(int));

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.

* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding! We should
mention which function supports mbchars not to confuse users.

* (Not an issue for the patch, but...)
Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
Unpacking versions make the core a bit faster.

--
Itagaki Takahiro


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-13 07:04:16
Message-ID: AANLkTilE4gCI4NnYhe3MxBurK9OiESYju8POczysrVcv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
> similar. Can you merge the code? We can always use less_equal_internal()
> if the overhead is ignorable. Did you compare them?
>
With big value of max_d overhead is significant. Here is example on
american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
sum
---------
1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from
words;
sum
---------
1386456
(1 row)

Time: 317,821 ms

> * There are many "if (!multibyte_encoding)" in levenshtein_internal().
> How about split the function into two funcs for single-byte chars and
> multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
> use multi-byte version if the overhead is small.
>
The overhead of multi-byte version was about 4 times slower. But I have
rewritten my CHAR_CMP macro with inline function. And now it's only about
1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited
levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
levenshtein_less_equal_internal_mb.

> * I prefer a struct rather than an array. "4 * m" and "3 * m" look like
> magic
> numbers for me. Could you name the entries with definition of a struct?
> /*
> * For multibyte encoding we'll also store array of lengths of
> * characters and array with character offsets in first string
> * in order to avoid great number of pg_mblen calls.
> */
> prev = (int *) palloc(4 * m * sizeof(int));
>
I this line of code the memory is allocated for 4 arrays: prev, curr,
offsets, char_lens. So I have joined offsets and char_lens into struct. But
I can't join prev and curr because of this trick:
temp = curr;
curr = prev;
prev = temp;

* There are some compiler warnings. Avoid them if possible.
> fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
> fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
> this function
> fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
> this function
> fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
> in this function
> fuzzystrmatch.c: In function ‘levenshtein_internal’:
> fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
> this function
>
Fixed.

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
> of 'm' is an integer.
>
Fixed.

> * Need to fix the caution in docs.
> http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
> | Caution: At present, fuzzystrmatch does not work well with
> | multi-byte encodings (such as UTF-8).
> but now levenshtein supports multi-byte encoding! We should
> mention which function supports mbchars not to confuse users.
>
I've updated this notification. Also I've added documentation for
levenshtein_less_equal function.

* (Not an issue for the patch, but...)
> Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
> PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
> Unpacking versions make the core a bit faster.
>
Fixed.

With best regards,
Alexander Korotkov.

Attachment Content-Type Size
fuzzystrmatch-0.4.diff.gz application/x-gzip 5.2 KB

From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-20 07:37:12
Message-ID: AANLkTimUH_YIs-T_KlEJzoGJuImxh74bBgJH6WVpCRS9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/7/13 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> Anyway I think that overhead is not ignorable. That's why I have splited
> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
> levenshtein_less_equal_internal_mb.

Thank you for good measurement! Then, it's reasonable to have multiple
implementations. It also has documentation. I'll change status of the
patch to "Ready for Committer".

The patch is good enough except argument types for some functions.
For example:
- char* vs. const char*
- text* vs. const char* + length
I hope committers would check whether there are better types for them.

--
Itagaki Takahiro


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 01:54:57
Message-ID: AANLkTikk-kbbKT=JHRrstSZTpUuei9JHTj7-5FAFs8nQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> 2010/7/13 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
>> Anyway I think that overhead is not ignorable. That's why I have splited
>> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
>> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
>> levenshtein_less_equal_internal_mb.
>
> Thank you for good measurement! Then, it's reasonable to have multiple
> implementations. It also has documentation. I'll change status of the
> patch to "Ready for Committer".
>
> The patch is good enough except argument types for some functions.
> For example:
>  - char* vs. const char*
>  - text* vs. const char* + length
> I hope committers would check whether there are better types for them.

This patch still needs some work. It includes a bunch of stylistic
changes that aren't relevant to the purpose of the patch. There's no
reason that I can see to change the existing levenshtein_internal
function to take text arguments instead of char *, or to change !m to
m == 0 in existing code, or to change the whitespace in the comments
of that function. All of those need to be reverted before we can
consider committing this.

There is a huge amount of duplicated code here. I think it makes
sense to have the multibyte version of the function be separate, but
perhaps we can merge the less-than-or-equal to bits into the main
code, so that we only have two copies instead of four. Perhaps we
can't just add a max_d argument max_distance to levenshtein_internal;
and if this value is >=0 then it represents the max allowable
distance, but if it is <0 then there is no limit. Sure, that might
slow down the existing code a bit, but it might not be significant.
I'd at least like to see some numbers showing that it is significant
before we go to this much trouble.

The code doesn't follow the project coding style. Braces must be
uncuddled. Comment paragraphs will be reflowed unless they begin and
end with ------. Function definitions should have the type
declaration on one line and the function name at the start of the
next.

Freeing memory with pfree is likely a waste of time; is there any
reason not to just rely on the memory context reset, as the original
coding did?

I think we might need to remove the acknowledgments section from this
code. If everyone who touches this code adds their name, we're
quickly going to have a mess. If we're not going to remove the
acknowledgments section, then please add my name, too, because I've
already patched this code once...

I'm going to set this back to "Waiting on Author".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 11:40:33
Message-ID: AANLkTiltvmu_iNzyl5ZHBKrdSgVA2yNoDpdgZfy9wOxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> This patch still needs some work. It includes a bunch of stylistic
> changes that aren't relevant to the purpose of the patch. There's no
> reason that I can see to change the existing levenshtein_internal
> function to take text arguments instead of char *, or to change !m to
> m == 0 in existing code, or to change the whitespace in the comments
> of that function. All of those need to be reverted before we can
> consider committing this.
>
I changed arguments of function from char * to text * in order to avoid
text_to_cstring call. Same benefit can be achived by replacing char * with
char * and length.
I changed !m to m == 0 because Itagaki asked me to make it conforming coding
style. Do you think there is no reason to fix coding style in existing
code?

> There is a huge amount of duplicated code here. I think it makes
> sense to have the multibyte version of the function be separate, but
> perhaps we can merge the less-than-or-equal to bits into the main
> code, so that we only have two copies instead of four. Perhaps we
> can't just add a max_d argument max_distance to levenshtein_internal;
> and if this value is >=0 then it represents the max allowable
> distance, but if it is <0 then there is no limit. Sure, that might
> slow down the existing code a bit, but it might not be significant.
> I'd at least like to see some numbers showing that it is significant
> before we go to this much trouble.
>
In these case we should add many checks of max_d in levenshtein_internal
function which make code more complex.
Actually, we can merge all four functions into one function. But such
function will have many checks about multibyte encoding and max_d. So, I see
four cases here:
1) one function with checks for multibyte encoding and max_d
2) two functions with checks for multibyte encoding
3) two functions with checks for max_d
4) four separate functions
If you prefer case number 3 you should argue your position little more.

> The code doesn't follow the project coding style. Braces must be
> uncuddled. Comment paragraphs will be reflowed unless they begin and
> end with ------. Function definitions should have the type
> declaration on one line and the function name at the start of the
> next.
>
> Freeing memory with pfree is likely a waste of time; is there any
> reason not to just rely on the memory context reset, as the original
> coding did?
>
Ok, I'll fix this things.

> I think we might need to remove the acknowledgments section from this
> code. If everyone who touches this code adds their name, we're
> quickly going to have a mess. If we're not going to remove the
> acknowledgments section, then please add my name, too, because I've
> already patched this code once...
>
In that case I think we can leave original acknowledgments section.

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 18:25:47
Message-ID: AANLkTi=Jhb5Ke4ST1HRQjYfsDPQypbtEPQfDNxtGvC4P@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> This patch still needs some work.  It includes a bunch of stylistic
>> changes that aren't relevant to the purpose of the patch.  There's no
>> reason that I can see to change the existing levenshtein_internal
>> function to take text arguments instead of char *, or to change !m to
>> m == 0 in existing code, or to change the whitespace in the comments
>> of that function.  All of those need to be reverted before we can
>> consider committing this.
>
> I changed arguments of function from char * to text * in order to avoid
> text_to_cstring call.

*scratches head* Aren't you just moving the same call to a different place?

> Same benefit can be achived by replacing char * with
> char * and length.
> I changed !m to m == 0 because Itagaki asked me to make it conforming coding
> style. Do you think there is no reason to fix coding style in existing
> code?

Yeah, we usually try to avoid changing that sort of thing in existing
code, unless there's a very good reason.

>> There is a huge amount of duplicated code here.  I think it makes
>> sense to have the multibyte version of the function be separate, but
>> perhaps we can merge the less-than-or-equal to bits  into the main
>> code, so that we only have two copies instead of four.  Perhaps we
>> can't just add a max_d argument max_distance to levenshtein_internal;
>> and if this value is >=0 then it represents the max allowable
>> distance, but if it is <0 then there is no limit.  Sure, that might
>> slow down the existing code a bit, but it might not be significant.
>> I'd at least like to see some numbers showing that it is significant
>> before we go to this much trouble.
>
> In these case we should add many checks of max_d in levenshtein_internal
> function which make code more complex.

When you say "many" checks, how many?

> Actually, we can merge all four functions into one function. But such
> function will have many checks about multibyte encoding and max_d. So, I see
> four cases here:
> 1) one function with checks for multibyte encoding and max_d
> 2) two functions with checks for multibyte encoding
> 3) two functions with checks for max_d
> 4) four separate functions
> If you prefer case number 3 you should argue your position little more.

I'm somewhat convinced that separating the multibyte case out has a
performance benefit both by intuition and because you posted some
numbers, but I haven't seen any argument for separating out the other
case, so I'm asking if you've checked and whether there is an effect
and whether it's significant. The default is always to try to avoid
maintaining multiple copies of substantially identical code, due to
the danger that a future patch might fail to update all of them and
thus introduce a bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 18:47:21
Message-ID: AANLkTin8t2ZlF8AjAdQxa6-Wm9ZgiTinyfTEPkNwPgNt@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> *scratches head* Aren't you just moving the same call to a different
> place?
>
So, where you can find this different place? :) In this patch
null-terminated strings are not used at all.

> Yeah, we usually try to avoid changing that sort of thing in existing
> code, unless there's a very good reason.
>
Ok.

> In these case we should add many checks of max_d in levenshtein_internal
> > function which make code more complex.
>
> When you say "many" checks, how many?
>
> > Actually, we can merge all four functions into one function. But such
> > function will have many checks about multibyte encoding and max_d. So, I
> see
> > four cases here:
> > 1) one function with checks for multibyte encoding and max_d
> > 2) two functions with checks for multibyte encoding
> > 3) two functions with checks for max_d
> > 4) four separate functions
> > If you prefer case number 3 you should argue your position little more.
>
> I'm somewhat convinced that separating the multibyte case out has a
> performance benefit both by intuition and because you posted some
> numbers, but I haven't seen any argument for separating out the other
> case, so I'm asking if you've checked and whether there is an effect
> and whether it's significant. The default is always to try to avoid
> maintaining multiple copies of substantially identical code, due to
> the danger that a future patch might fail to update all of them and
> thus introduce a bug.
>

I've tested it with big value of max_d and I thought that it's evident that
checking for negative value of max_d will not produce significant benefit.
Anyway, I tried to add checking for negative max_d into
levenshtein_less_equal_mb function.

static int
levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len,
int ins_c, int del_c, int sub_c, int max_d)
{
int m,
n;
int *prev;
int *curr;
int i,
j;
const char *x;
const char *y;
CharLengthAndOffset *lengths_and_offsets;
int y_char_len;
int curr_left, curr_right, prev_left, prev_right, d;
int delta, min_d;

/*
* We should calculate number of characters for multibyte encodings
*/
m = pg_mbstrlen_with_len(s, s_len);
n = pg_mbstrlen_with_len(t, t_len);

/*
* We can transform an empty s into t with n insertions, or a non-empty
t
* into an empty s with m deletions.
*/
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
* We can find the minimal distance by the difference of lengths
*/
delta = m - n;
if (delta > 0)
min_d = delta * del_c;
else if (delta < 0)
min_d = - delta * ins_c;
else
min_d = 0;
if (max_d >= 0 && min_d > max_d)
return max_d + 1;

/*
* For security concerns, restrict excessive CPU+RAM usage. (This
* implementation uses O(m) memory and has O(mn) complexity.)
*/
if (m > MAX_LEVENSHTEIN_STRLEN ||
n > MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("argument exceeds the maximum length of %d bytes",
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;

/*
* Instead of building an (m+1)x(n+1) array, we'll use two different
* arrays of size m+1 for storing accumulated values. At each step one
* represents the "previous" row and one is the "current" row of the
* notional large array.
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of
* pg_mblen calls.
*/
prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) *
m );
curr = prev + m;
lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m);
lengths_and_offsets[0].offset = 0;
for (i = 0, x = s; i < m - 1; i++)
{
lengths_and_offsets[i].length = pg_mblen(x);
lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset +
lengths_and_offsets[i].length;
x += lengths_and_offsets[i].length;
}
lengths_and_offsets[i].length = 0;

/* Initialize the "previous" row to 0..cols */
curr_left = 1;
d = min_d;
for (i = 0; i < delta; i++)
{
prev[i] = d;
}
curr_right = m;
for (; i < m; i++)
{
prev[i] = d;
d += (ins_c + del_c);
if (max_d >= 0 && d > max_d)
{
curr_right = i;
break;
}
}

/*
* There are following optimizations:
* 1) Actually the minimal possible value of final distance (in the case
of
* all possible matches) is stored is the cells of the matrix. In the
case
* of movement towards diagonal, which contain last cell, value should
be
* increased by ins_c + del_c. In the case of movement backwards this
* diagonal cell value should not be increased.
* 2) The range of indexes of previous row, where values, which is not
* greater than max_d, are stored, is [prev_left, prev_right]. So, only
* the cells connected with this range should be calculated.
* For multibyte encoding we should increase x and y pointers by the
* results of pg_mblen function. Also we should use CHAR_CMP macros
* for character comparison.
*/
for (y = t, j = 1; j < n; y+= y_char_len, j++)
{
int *temp;
y_char_len = pg_mblen(y);

if (max_d >= 0)
{
prev_left = curr_left;
prev_right = curr_right;
curr_left = -1;
}else{
prev_left = 1;
}

if (delta >= 0)
curr[0] = min_d + j * (ins_c + del_c);
else{
if (j <= - delta)
curr[0] = min_d;
else
curr[0] = min_d + (j + delta) * (ins_c + del_c);
}

for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i <
m; x+= lengths_and_offsets[i - 1].length, i++)
{
if (max_d >= 0)
{
d = max_d + 1;

if (i <= prev_right){
d = Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
d);
}

if (i == 1 || i > prev_left)
{
d = Min(curr[i - 1] + ((i - delta > j)?(ins_c +
del_c):0), d);
d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i
- 1].length, y, y_char_len) ? 0 : sub_c), d);
}

curr[i] = d;

if (curr_left == -1)
curr_left = i;
curr_right = i;
if (i > prev_right)
break;
}else{
d = Min(Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0)),
prev[i - 1] + (char_cmp(x, lengths_and_offsets[i -
1].length, y, y_char_len) ? 0 : sub_c));
curr[i] = d;
}
}

if (curr_left == -1)
break;

temp = curr;
curr = prev;
prev = temp;
}

/*
* Because the final value was swapped from the previous row to the
* current row, that's where we'll find it.
*/
d = prev[m - 1];

/*
* If the last cell wasn't filled than return max_d + 1 otherwise
* return the final value.
*/
if (max_d >= 0 && (curr_left == -1 || curr_right < m - 1))
return max_d + 1;
else
return d;
}

I tested it with american-english dictionary with 98569 words.

test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
sum
---------
1074376
(1 row)

Time: 131,435 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
words;
sum
---------
1074376
(1 row)

Time: 221,078 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
words;
sum
---------
1074376
(1 row)

Time: 254,819 ms

The function with negative value of max_d didn't become faster than with
just big value of max_d.

----
With best regards,
Alexander Korotkov.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 21:34:16
Message-ID: 1279747905-sup-6308@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Robert Haas's message of mié jul 21 14:25:47 -0400 2010:
> On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> > Same benefit can be achived by replacing char * with
> > char * and length.
> > I changed !m to m == 0 because Itagaki asked me to make it conforming coding
> > style. Do you think there is no reason to fix coding style in existing
> > code?
>
> Yeah, we usually try to avoid changing that sort of thing in existing
> code, unless there's a very good reason.

I think fixing a stylistic issue in code that's being edited for other
purposes is fine, and a good idea going forward. We wouldn't commit a
patch that would *only* fix those, because that would cause a problem
for backpatches for no benefit, but if the patch touches something else,
then a backpatch of another patch is going to need manual intervention
anyway.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-21 21:59:20
Message-ID: AANLkTikvz=wp7r72Jd8iWrdAegawijzZ6ejm08WXEzpW@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> *scratches head*  Aren't you just moving the same call to a different
>> place?
>
> So, where you can find this different place? :) In this patch
> null-terminated strings are not used at all.

I can't. You win. :-)

Actually, I wonder if there's enough performance improvement there
that we might think about extracting that part of the patch and apply
it separately. Then we could continue trying to figure out what to do
with the rest. Sometimes it's simpler to deal with one change at a
time.

> I tested it with american-english dictionary with 98569 words.
>
> test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
>    sum
> ---------
>  1074376
> (1 row)
>
> Time: 131,435 ms
> test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
> words;
>    sum
> ---------
>  1074376
> (1 row)
>
> Time: 221,078 ms
> test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
> words;
>    sum
> ---------
>  1074376
> (1 row)
>
> Time: 254,819 ms
>
> The function with negative value of max_d didn't become faster than with
> just big value of max_d.

Ah, I see. That's pretty compelling, I guess. Although it still
seems like a lot of code...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-22 07:21:57
Message-ID: AANLkTikVH_9JAeEvnGlzZGcZPgrgo-9IUWXFF_ba2mp2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Ah, I see. That's pretty compelling, I guess. Although it still
> seems like a lot of code...
>
I think there is a way to merge single-byte and multi-byte versions of
functions without loss in performance using macros and includes (like in
'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

----
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, alvherre(at)commandprompt(dot)com, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-22 08:23:35
Message-ID: AANLkTilQ5bzoOcA_A6mRC0U8BOyXcmwsd4BzZc_0QwlK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Such version with macros and includes can look like this:

#ifdef MULTIBYTE
#define NEXT_X (x+= char_lens[i-1])
#define NEXT_Y (y+= y_char_len)
#define CMP (char_cmp(x, char_lens[i-1], y, y_char_len))
#else
#define NEXT_X (x++)
#define NEXT_Y (y++)
#define CMP (*x == *y)
#endif

static int
levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c)
{
int m,
n,
d;
int *prev;
int *curr;
int i,
j;
const char *x;
const char *y;
#ifdef MULTIBYTE
int *char_lens;
int y_char_len;
#endif

/*
* We should calculate number of characters for multibyte encodings
*/
m = pg_mbstrlen_with_len(VARDATA_ANY(s), VARSIZE_ANY_EXHDR(s));
n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));

/*
* We can transform an empty s into t with n insertions, or a non-empty
t
* into an empty s with m deletions.
*/
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
* For security concerns, restrict excessive CPU+RAM usage. (This
* implementation uses O(m) memory and has O(mn) complexity.)
*/
if (m > MAX_LEVENSHTEIN_STRLEN ||
n > MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("argument exceeds the maximum length of %d bytes",
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;

/*
* Instead of building an (m+1)x(n+1) array, we'll use two different
* arrays of size m+1 for storing accumulated values. At each step one
* represents the "previous" row and one is the "current" row of the
* notional large array.
* For multibyte encoding we'll also store array of lengths of
* characters of first string in order to avoid great number of
* pg_mblen calls.
*/
#ifdef MULTIBYTE
prev = (int *) palloc(3 * m * sizeof(int));
curr = prev + m;
char_lens = prev + 2 * m;
for (i = 0, x = VARDATA_ANY(s); i < m - 1; i++)
{
char_lens[i] = pg_mblen(x);
x += char_lens[i];
}
char_lens[i] = 0;
#else
prev = (int *) palloc(2 * m * sizeof(int));
curr = prev + m;
#endif

/* Initialize the "previous" row to 0..cols */
for (i = 0; i < m; i++)
prev[i] = i * del_c;

/*
* For multibyte encoding we should increase x and y pointers by the
* results of pg_mblen function. Also we should use CHAR_CMP macros
* for character comparison.
*/
for (y = VARDATA_ANY(t), j = 1; j < n; NEXT_Y, j++)
{
int *temp;
#ifdef MULTIBYTE
y_char_len = pg_mblen(y);
#endif

curr[0] = j * ins_c;

for (x = VARDATA_ANY(s), i = 1; i < m; NEXT_X, i++)
{
int ins;
int del;
int sub;

ins = prev[i] + ins_c;
del = curr[i - 1] + del_c;
sub = prev[i - 1] + (CMP ? 0 : sub_c);
curr[i] = Min(ins, del);
curr[i] = Min(curr[i], sub);
}

temp = curr;
curr = prev;
prev = temp;
}

d = prev[m - 1];

/*
* Because the final value was swapped from the previous row to the
* current row, that's where we'll find it.
*/
return d;
}

What do you thing about it?

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-22 13:23:50
Message-ID: AANLkTinr1f+TJnLh9XW728VB11ZM31w8hRNG3uAC=4L8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 22, 2010 at 3:21 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> Ah, I see.  That's pretty compelling, I guess.  Although it still
>> seems like a lot of code...
>
> I think there is a way to merge single-byte and multi-byte versions of
> functions without loss in performance using macros and includes (like in
> 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

I'm not sure. I'd like to get a second opinion from someone else on
which way to go with this, but unfortunately 2 or 3 of the main
committers are on vacation at the moment.

Does anyone else have an opinion on what to do about this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-23 02:35:57
Message-ID: 1279852309-sup-475@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Alexander Korotkov's message of jue jul 22 03:21:57 -0400 2010:
> On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> > Ah, I see. That's pretty compelling, I guess. Although it still
> > seems like a lot of code...
> >
> I think there is a way to merge single-byte and multi-byte versions of
> functions without loss in performance using macros and includes (like in
> 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

Using the trick of a single source file similar to like_match.c that
implements all the different behaviors, and include that file more than
once with different macro definitions, is probably a good idea.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-24 15:27:47
Message-ID: AANLkTi=m5aNd0a0-gw_9bvyeVapaSUEE8JfcsVonSksJ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here is new version of my patch. There are following changes:

1) I've merged singlebyte and multibyte versions of levenshtein_internal and
levenshtein_less_equal_internal using macros and includes.
2) I found that levenshtein takes reasonable time even for long strings.
There is an example with strings which contain random combinations of words.

test=# select count(*) from words;
count
-------
98569
(1 row)

test=# select * from words limit 10;
id |
word | next_id
-------+------------------------------------------------------------------------------------+---------
200 | Agnew diodes drilled composts Wassermann nonprofit adjusting
Chautauqua | 17370
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed
zither | 70983
5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal
punctuality | 96685
6103 | G prompt boron overthrew cricking begonia snuffers
blazed | 34518
4707 | Djibouti Mari pencilling approves pointing's pashas magpie
rams | 71200
10125 | Lyle Edgardo livers gruelled capable cancels gnawing's
inhospitable | 28018
9615 | Leos deputy evener yogis assault palatals Octobers
surceased | 21878
15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier
Algeria | 19316
15776 | Sucre battle's misapprehension's predicating nutmegged
electrification bowl planks | 65739
16878 | Updike Forster ragout keenly exhalation grayed switch
guava's | 43013
(10 rows)

Time: 26,125 ms

Time: 137,061 ms
test=# select avg(length(word)) from words;
avg
---------------------
74.6052207083362923
(1 row)

Time: 96,415 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2957,426 ms
test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',10)<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 84,755 ms

It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why
I changed limitation of this function to max_d <= 255 * min(ins_c, del_c)

3) I found that there is no need for storing character offsets in
levenshtein_less_equal_internal_mb. Instead of this first position in
string, where value of distance wasn't greater than max_d, can be stored.

4) I found that there is theoretical maximum distance based of string
lengths. It represents the case, when no characters are matching. If
theoretical maximum distance is less than given maximum distance we can use
theoretical maximum distance. It improves behavior of levenshtein_less_equal
with high max_d, but it's still slower than levenshtein:

test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',1000)<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 4102,731 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2983,244 ms

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
fuzzystrmatch-0.5.tar.gz application/x-gzip 5.1 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-29 15:27:05
Message-ID: AANLkTik9WxjFbLx=nJquUpRsZKj+dZL4Cy+iopL-i9ar@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I forgot attribution in levenshtein.c file.

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
fuzzystrmatch-0.5.1.tar.gz application/x-gzip 5.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-29 20:16:19
Message-ID: AANLkTim+eW5RJNDhGb9Fcj2TwiVnich-k1Tvf=zzYj0=@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 21, 2010 at 5:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
>> On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>
>>> *scratches head*  Aren't you just moving the same call to a different
>>> place?
>>
>> So, where you can find this different place? :) In this patch
>> null-terminated strings are not used at all.
>
> I can't.  You win.  :-)
>
> Actually, I wonder if there's enough performance improvement there
> that we might think about extracting that part of the patch and apply
> it separately.  Then we could continue trying to figure out what to do
> with the rest.  Sometimes it's simpler to deal with one change at a
> time.

I tested this today and the answer was a resounding yes. I ran
sum(levenshtein(t, 'foo')) over a dictionary file with about 2 million
words and got a speedup of around 15% just by eliminating the
text_to_cstring() calls. So I've committed that part of this patch.

I'll try to look at the rest of the patch when I get a chance, but I'm
wondering if it might make sense to split it into two patches -
specifically, one patch to handle multi-byte characters correctly, and
then a second patch for the less-than-or-equal-to functions. I think
that might simplify reviewing a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 01:20:34
Message-ID: AANLkTik-BLr8tumY1eBxQM4SYDbhZRkoynRp3N+q07Cc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Ok, here is the patch for multi-byte characters.
> I changed arguments of levenshtein_internal function from text * to const
> char * and int. I think that it makes levenshtein_internal more reusable.
> For example, this function can be used for calculation distance of two
> fragments of strings without need of copying of these fragments.
> Actullay, I'm not fully satisfied with my solution in merging of single-byte
> and multi-byte versions of function. There are "ifdef"'s in body of function
> and non context-free macros. This is due to multi-byte needs to store
> characters lengths returned by pg_mblen when single-byte version doesn't
> need that. I tried multi-byte version without storing characters lengths,
> but in this case function is about 1.5 times slower (function needs to call
> pg_mblen n*m times).

I reviewed this code in a fair amount of detail today and ended up
rewriting it. In general terms, it's best to avoid changing things
that are not relevant to the central purpose of the patch. This patch
randomly adds a whole bunch of whitespace and removes a whole bunch of
comments which I find useful and do not want to see removed. It took
me about an hour to reverse out all of those changes, which then let
me get down to the business of actually analyzing the code. The
biggest problem I found is that, as coded, this is more than 50%
slower on UTF-8 databases, because the multi-byte path is really slow,
even if there are actually no multi-byte characters present.

The attached patch takes the approach of using a fast-path for just
the innermost loop when neither string contains any multi-byte
characters (regardless of whether or not the encoding allows
multi-byte characters). It's still slower than CVS HEAD, but for
strings containing no multi-byte characters it's only marginally, if
at all, slower than previous releases, because 9.0 doesn't contain
your fix to avoid the extra string copy; and it's faster than your
implementation even when multi-byte characters ARE present. It is
slightly slower on single-byte encoding databases (I tested
SQL_ASCII), but still faster than previous releases. It's also quite
a bit less code duplication.

Please review and let me know your thoughts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
mb_levenshtein_rmh.patch application/octet-stream 5.0 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 10:57:18
Message-ID: AANLkTimpvXnGZdJjDf3iTKq8=FnQB00LK6tPAb1uJfBr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I reviewed this code in a fair amount of detail today and ended up
> rewriting it. In general terms, it's best to avoid changing things
> that are not relevant to the central purpose of the patch. This patch
> randomly adds a whole bunch of whitespace and removes a whole bunch of
> comments which I find useful and do not want to see removed. It took
> me about an hour to reverse out all of those changes, which then let
> me get down to the business of actually analyzing the code.

I'm sorry. This is my first patch, I will be more accurate next time. But, I
think there is no unified opinion about some changes like replacement "!m"
with "m==0".

I think this line is not correct:
"if (m != s_bytes && n != t_bytes)"
The correct negation of assumption, that both strings are single-byte, is
the assumption, that at least one string is not single-byte. In this patch
levenshtein function can calculate distance incorrectly:
test=# select levenshtein('фw', 'ww');
levenshtein
-------------
2
(1 row)
This line should be rewritten by this.
"if (m != s_bytes || n != t_bytes)"

The attached patch takes the approach of using a fast-path for just
> the innermost loop when neither string contains any multi-byte
> characters (regardless of whether or not the encoding allows
> multi-byte characters). It's still slower than CVS HEAD, but for
> strings containing no multi-byte characters it's only marginally, if
> at all, slower than previous releases, because 9.0 doesn't contain
> your fix to avoid the extra string copy; and it's faster than your
> implementation even when multi-byte characters ARE present. It is
> slightly slower on single-byte encoding databases (I tested
> SQL_ASCII), but still faster than previous releases. It's also quite
> a bit less code duplication.
>

Can I look at the test which shows that your implementation is faster than
my when multi-byte characters are present? My tests show reverse. For
example, with russian dictionary of openoffice:

With my version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
sum
---------
1675281
(1 row)

Time: 277,190 ms

With your version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
sum
---------
1675281
(1 row)

Time: 398,011 ms

I think that the cause of slow down slowdown is memcmp function. This
function is very fast for long parts of memory, but of few bytes inline
function is faster, because compiler have more freedom for optimization.
After replacing memcmp with inline function like following in your
implementation:

static inline bool char_cmp(const char *s, const char *d, int len)
{
while (len--)
{
if (*s++ != *d++)
return false;
}
return true;
}

Code becomes much faster:

test=# select sum(levenshtein(word, 'фывафыва')) from test;
sum
---------
1675281
(1 row)

Time: 241,272 ms

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 14:48:37
Message-ID: AANLkTim0ZZs8zp6DM7qowV4jwyYhy1Dy+_fN6ApEWBf5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/8/2 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I reviewed this code in a fair amount of detail today and ended up
>> rewriting it.  In general terms, it's best to avoid changing things
>> that are not relevant to the central purpose of the patch.  This patch
>> randomly adds a whole bunch of whitespace and removes a whole bunch of
>> comments which I find useful and do not want to see removed.  It took
>> me about an hour to reverse out all of those changes, which then let
>> me get down to the business of actually analyzing the code.
> I'm sorry. This is my first patch, I will be more accurate next time. But, I
> think there is no unified opinion about some changes like replacement "!m"
> with "m==0".

Sure; if that were the only such change, I wouldn't have mentioned it.

> I think this line is not correct:
> "if (m != s_bytes && n != t_bytes)"

Good catch, fixed in the attached version.

>> The attached patch takes the approach of using a fast-path for just
>> the innermost loop when neither string contains any multi-byte
>> characters (regardless of whether or not the encoding allows
>> multi-byte characters).  It's still slower than CVS HEAD, but for
>> strings containing no multi-byte characters it's only marginally, if
>> at all, slower than previous releases, because 9.0 doesn't contain
>> your fix to avoid the extra string copy; and it's faster than your
>> implementation even when multi-byte characters ARE present.  It is
>> slightly slower on single-byte encoding databases (I tested
>> SQL_ASCII), but still faster than previous releases.  It's also quite
>> a bit less code duplication.
>
> Can I look at the test which shows that your implementation is faster than
> my when multi-byte characters are present? My tests show reverse. For
> example, with russian dictionary of openoffice:

Hmm, with the correction you mention above, I'm getting about the same
results with multi-byte characters (but faster with only single-byte
characters). I did this:

CREATE TABLE words (a text not null primary key);
COPY words from '/usr/share/dict/words';

SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;

With the updated patch attached below, these ran about 102 ms, 222 ms,
129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps
this is because I only have multi-byte characters in one of the
inputs, not both. I don't have a dictionary handy with multi-byte
words in it. Can you send me a file?

> I think that the cause of slow down slowdown is memcmp function. This
> function is very fast for long parts of memory, but of few bytes inline
> function is faster, because compiler have more freedom for optimization.
> After replacing memcmp with inline function like following in your
> implementation:

Yeah, that does seem to be slightly better. I've done a version of
this in the attached patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
mb_levenshtein_rmh-v2.patch application/octet-stream 5.5 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 20:53:05
Message-ID: AANLkTimZcAjwCJyRgwKaM3Ms_0ipO8NteVCnfgKCf7RL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/8/2 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> The dump of the table with russian dictionary is in attachment.
>
> I use following tests:
> SELECT SUM(levenshtein(a, 'foo')) from words;
> SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
> SELECT SUM(levenshtein(a, 'ańs')) FROM words;
> SELECT SUM(levenshtein(a, 'foo')) from words2;
> SELECT SUM(levenshtein(a, 'дом')) FROM words2;
> SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;
>
> With your last version of patch results are:
> 63ms 94ms 61ms 100ms 121ms 217ms.
>
> But I found some other optimization. With this condition:
> if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len &&
>     (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
> results become:
> 58ms 96ms 63ms 102ms 107ms 171ms
>
> On single-byte characters results almost didn't change, but they come better
> with multi-byte characters. Generally, this improvement is because first
> bytes of different characters are frequently same (for example, almost all
> Cyrillic characters start from same byte, and I think that there is similar
> situation in other alphabets), but match of last bytes is just a
> coincidence.

Hey, that's really cool. It helps a lot here, too:

My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms

Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms

My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms

Updated patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
mb_levenshtein_rmh-v3.patch application/octet-stream 5.8 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 21:07:31
Message-ID: AANLkTikc8b42bc59ucagwNzyTjHTaoW9eStZXuA=xae7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Now I think patch is as good as can be. :)
I'm going to prepare less-or-equal function in same manner as this patch.

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 23:23:38
Message-ID: AANLkTik9_8h=Rdu06xS+X40dRc2Mo9iU2+mpSfRqfXCy@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> Now I think patch is as good as can be. :)

OK, committed.

> I'm going to prepare less-or-equal function in same manner as this patch.

Sounds good. Since we're now more than half-way through this
CommitFest and this patch has undergone quite a bit of change from
what we started out with, I'd like to ask that you submit the new
patch for the next CommitFest, to begin September 15th.

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-28 12:34:10
Message-ID: AANLkTinS40Fc-_dpEOMMbGBuHKjVDMDVpGOkdjYE1c=s@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here is the patch which adds levenshtein_less_equal function. I'm going to
add it to current commitfest.

----
With best regards,
Alexander Korotkov.

On Tue, Aug 3, 2010 at 3:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > Now I think patch is as good as can be. :)
>
> OK, committed.
>
> > I'm going to prepare less-or-equal function in same manner as this patch.
>
> Sounds good. Since we're now more than half-way through this
> CommitFest and this patch has undergone quite a bit of change from
> what we started out with, I'd like to ask that you submit the new
> patch for the next CommitFest, to begin September 15th.
>
> https://commitfest.postgresql.org/action/commitfest_view/open
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

Attachment Content-Type Size
levenshtein_less_equal-0.1.patch text/x-patch 16.4 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-28 17:03:34
Message-ID: 19325098-45F3-4EEF-89CC-13EF0AAE751F@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 28, 2010, at 8:34 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest.

Cool. Please submit some performance results comparing levenshtein in HEAD vs. levenshtein with this patch vs. levenshtein_less_equal. Perhaps the test cases we used previously would be a good place to start.

...Robert


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-28 18:33:40
Message-ID: AANLkTimVPvFcPzSsW5bUpDMk6uKY7KykxiA_c+0N5hhK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;
SELECT SUM(levenshtein(a, 'foo')) from words2;
SELECT SUM(levenshtein(a, 'дом')) FROM words2;
SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;

Before this patch results was:
67 98 63 102 108 177
And after patch:
65 100 65 104 111 182
It is slower a bit, but I think it is a price for having less code
duplication.

Now test for levenshtein_less_equal performance.

test=# SELECT * FROM words WHERE levenshtein(a, 'extensize') <= 2;
a
-----------
expensive
extension
extensive
(3 rows)

Time: 98,083 ms

test=# SELECT * FROM words WHERE levenshtein_less_equal(a, 'extensize', 2)
<= 2;
a
-----------
expensive
extension
extensive
(3 rows)

Time: 48,069 ms

test=# SELECT * FROM words2 WHERE levenshtein(a, 'клубничный') <= 2;
a
------------
кузничный
кубичный
клубочный
клубничный
(4 rows)

Time: 197,499 ms

test=# SELECT * FROM words2 WHERE levenshtein_less_equal(a, 'клубничный', 3)
<= 2;
a
------------
кузничный
кубичный
клубочный
клубничный
(4 rows)

Time: 100,073 ms

It gives some speedup for search on word with small distances.

test=# SELECT sum(levenshtein(a, 'extensize')) from words;
sum
--------
835545
(1 row)

Time: 96,253 ms

test=# SELECT sum(levenshtein_less_equal(a, 'extensize', 20)) from words;
sum
--------
835545
(1 row)

Time: 98,438 ms

With big distances it gives similar performance because in this case similar
branch of code is used.
In order to test it with longer words I created a table with random
combinations of words.

test=# create table phrases (a text primary key);
test=# insert into phrases select string_agg(y.a, ' ') from (select x.a,
row_number() over () from (select a from words order by random()) x) y group
by (y.row_number / 10);

test=# select * from phrases limit 10;

a
--------------------------------------------------------------------------------------------------------------
hosiery's spermatozoa Haleakala disco greenness beehive paranoids atrophy
unfair
Weinberg's all profanity's individualized bellowed perishables Romanov's
communicant's landlady's pauperized
Ito seasoned jawbreakers you'll hurling's showcasing anecdota cauliflower
intellectualism facilitate
infantryman's busheled designing lucid nutritionists Aesculapius's rifle
clefting exult Milosevic
foresight lurking capitulation's pantsuits traumatizes moonlighting
lancer's Longfellow rising unclasps
permutes centralest cavalryman Dwight granddaughter knight tallyho's sober
graduate locket's
knucklehead courtliest sapphires baize coniferous emolument antarctic
Laocoon's deadens unseemly
Tartars utopians Pekingese's Cartwright badge's Dollie's spryness Ajax's
Amber's bookkeeper
spouting concordant Indus carbonation cinnamon's stockbrokers Evita's
Canaries Waldorf's rampage
Amparo exertions Kuznetsk's divots humongous wolverine chugged laurels
Goliaths hazelnut
(10 rows)

test=# select * from phrases where levenshtein('kkkknucklehead courtliest
sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a)
<= 10;

a
--------------------------------------------------------------------------------------------------
knucklehead courtliest sapphires baize coniferous emolument antarctic
Laocoon's deadens unseemly
(1 row)

Time: 581,850 ms

test=# select * from phrases where levenshtein_less_equal('kkkknucklehead
courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens
unseemly', a, 10) <= 10;

a
--------------------------------------------------------------------------------------------------
knucklehead courtliest sapphires baize coniferous emolument antarctic
Laocoon's deadens unseemly
(1 row)

Time: 22,876 ms

It gives great speedup for search with small distances on relatively long
strings.

test=# select sum(levenshtein('kkkknucklehead courtliest sapphires be
coniferous emolument antarctic Laocoon''s deadens unseemly', a)) from
phrases;
sum
--------
794601
(1 row)

Time: 571,138 ms

test=# select sum(levenshtein_less_equal('kkkknucklehead courtliest
sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a,
100)) from phrases;
sum
--------
794601
(1 row)

Time: 562,895 ms

Similar results appears for multi-byte strings.

test=# create table phrases2 (a text primary key);
test=# insert into phrases2 select string_agg(y.a, ' ') from (select x.a,
row_number() over () from (select a from words2 order by random()) x) y
group by (y.row_number / 10);
INSERT 0 14584

test=# select * from phrases2 limit 10;

a

-----------------------------------------------------------------------------------------------------------------------------------
борис спиннинг растопочный цифра выдохнуть иридий гнёздышко омоновский
базовый
пожжёте закосить насыщающий паратый продрых обеспложивание милливатт
бамбуковый отпекающий книжница
приложиться разный устремляющийся отучающийся абрикосовка протоколируются
сострадательный отрясший вывалить браманизм
наполниться агафоновна испольный подтаскивающий запруживавшийся трофимович
перетаскиваемый обручавший процентщица передов
вихрастый отволочённый дискотека пришей распасовывавший полиция возделавший
трехглавый битва загазованный
обовьетесь перехитривший инулин стелить недоброжелательность изотрёте
пятиалтынный жабовидный щелочно дефицитность
сиротиночка хлорбензол вгрызаться прокрашивать никарагуанец валентный
понесённый перестегивание воспретительный переименовываемый
таявший раскупорившийся передислоцируется юлианович праздничный лачужка
присыхать отбеливший фехтовальный удобряющий
слепнул зонт уластить удобрение тормозивший ушибся ошибся мкс
сейсмологический лесомелиорация
рабовладельцем неудачный самовоспламенение карбидный круглопильный кубинцем
подлакированный наблюдение исцарапавшийся издёргавший
(10 rows)

test=# select * from phrases2 where levenshtein('таяй раскупорившийся
передислоцируется юлианович праздничный лачужка присыхать опппливший
ффехтовальный уууудобряющий', a) <= 10;

a

------------------------------------------------------------------------------------------------------------------
----
таявший раскупорившийся передислоцируется юлианович праздничный лачужка
присыхать отбеливший фехтовальный удобряющий
(1 row)

Time: 1291,318 ms

test=# select * from phrases2 where levenshtein_less_equal('таяй
раскупорившийся передислоцируется юлианович праздничный лачужка присыхать
опппливший ффехтовальный уууудобряющий', a, 10) <= 10;

a

------------------------------------------------------------------------------------------------------------------
----
таявший раскупорившийся передислоцируется юлианович праздничный лачужка
присыхать отбеливший фехтовальный удобряющий
(1 row)

Time: 55,405 ms

test=# select sum(levenshtein_less_equal('таяй раскупорившийся
передислоцируется юлианович праздничный лачужка присыхать опппливший
ффехтовальный уууудобряющий', a, 200)) from phrases;
sum
---------
1091878
(1 row)

Time: 674,734 ms

test=# select sum(levenshtein('таяй раскупорившийся передислоцируется
юлианович праздничный лачужка присыхать опппливший ффехтовальный
уууудобряющий', a)) from phrases;
sum
---------
1091878
(1 row)

Time: 673,515 ms

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-09-02 02:58:13
Message-ID: AANLkTim8zXpy6_=+7OwUqOaxuvHsRUP++1-8BmvenpQ-@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/8/28 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> Now test for levenshtein_less_equal performance.

Nice results. I'll try to find time to look at this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company