Re: multibyte charater set in levenshtein function

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-13 07:04:16
Message-ID: AANLkTilE4gCI4NnYhe3MxBurK9OiESYju8POczysrVcv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
> similar. Can you merge the code? We can always use less_equal_internal()
> if the overhead is ignorable. Did you compare them?
>
With big value of max_d overhead is significant. Here is example on
american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
sum
---------
1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from
words;
sum
---------
1386456
(1 row)

Time: 317,821 ms

> * There are many "if (!multibyte_encoding)" in levenshtein_internal().
> How about split the function into two funcs for single-byte chars and
> multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
> use multi-byte version if the overhead is small.
>
The overhead of multi-byte version was about 4 times slower. But I have
rewritten my CHAR_CMP macro with inline function. And now it's only about
1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited
levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
levenshtein_less_equal_internal_mb.

> * I prefer a struct rather than an array. "4 * m" and "3 * m" look like
> magic
> numbers for me. Could you name the entries with definition of a struct?
> /*
> * For multibyte encoding we'll also store array of lengths of
> * characters and array with character offsets in first string
> * in order to avoid great number of pg_mblen calls.
> */
> prev = (int *) palloc(4 * m * sizeof(int));
>
I this line of code the memory is allocated for 4 arrays: prev, curr,
offsets, char_lens. So I have joined offsets and char_lens into struct. But
I can't join prev and curr because of this trick:
temp = curr;
curr = prev;
prev = temp;

* There are some compiler warnings. Avoid them if possible.
> fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
> fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
> this function
> fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
> this function
> fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
> in this function
> fuzzystrmatch.c: In function ‘levenshtein_internal’:
> fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
> this function
>
Fixed.

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
> of 'm' is an integer.
>
Fixed.

> * Need to fix the caution in docs.
> http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
> | Caution: At present, fuzzystrmatch does not work well with
> | multi-byte encodings (such as UTF-8).
> but now levenshtein supports multi-byte encoding! We should
> mention which function supports mbchars not to confuse users.
>
I've updated this notification. Also I've added documentation for
levenshtein_less_equal function.

* (Not an issue for the patch, but...)
> Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
> PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
> Unpacking versions make the core a bit faster.
>
Fixed.

With best regards,
Alexander Korotkov.

Attachment Content-Type Size
fuzzystrmatch-0.4.diff.gz application/x-gzip 5.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajanikant Chirmade 2010-07-13 07:10:35 Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Previous Message Itagaki Takahiro 2010-07-13 06:50:54 Re: patch (for 9.1) string functions