Quick Links

Re: multibyte charater set in levenshtein function

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: multibyte charater set in levenshtein function
Date:	2010-07-12 04:46:25
Message-ID:	AANLkTiksZIMvgthlIbu5S30OXA3Q2QnUNM0XUAr9ESQ_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch.
https://commitfest.postgresql.org/action/patch_view?id=304

The main logic seems to be good, but I have some comments about
the coding style and refactoring.

* levenshtein_internal() and levenshtein_less_equal_internal() are very
similar. Can you merge the code? We can always use less_equal_internal()
if the overhead is ignorable. Did you compare them?

* There are many "if (!multibyte_encoding)" in levenshtein_internal().
How about split the function into two funcs for single-byte chars and
multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
use multi-byte version if the overhead is small.

* I prefer a struct rather than an array. "4 * m" and "3 * m" look like magic
numbers for me. Could you name the entries with definition of a struct?
/*
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of pg_mblen calls.
*/
prev = (int *) palloc(4 * m * sizeof(int));

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.

* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding! We should
mention which function supports mbchars not to confuse users.

* (Not an issue for the patch, but...)
Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
Unpacking versions make the core a bit faster.

--
Itagaki Takahiro

Responses

Re: multibyte charater set in levenshtein function at 2010-07-13 07:04:16 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-07-12 04:54:10	WIP patch: pass outer-relation Vars as parameters to indexscans
Previous Message	KaiGai Kohei	2010-07-12 04:09:08	Re: Reworks of DML permission checks