From: | Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-07-12 04:46:25 |
Message-ID: | AANLkTiksZIMvgthlIbu5S30OXA3Q2QnUNM0XUAr9ESQ_@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch.
https://commitfest.postgresql.org/action/patch_view?id=304
The main logic seems to be good, but I have some comments about
the coding style and refactoring.
* levenshtein_internal() and levenshtein_less_equal_internal() are very
similar. Can you merge the code? We can always use less_equal_internal()
if the overhead is ignorable. Did you compare them?
* There are many "if (!multibyte_encoding)" in levenshtein_internal().
How about split the function into two funcs for single-byte chars and
multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
use multi-byte version if the overhead is small.
* I prefer a struct rather than an array. "4 * m" and "3 * m" look like magic
numbers for me. Could you name the entries with definition of a struct?
/*
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of pg_mblen calls.
*/
prev = (int *) palloc(4 * m * sizeof(int));
* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function
* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.
* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding! We should
mention which function supports mbchars not to confuse users.
* (Not an issue for the patch, but...)
Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
Unpacking versions make the core a bit faster.
--
Itagaki Takahiro
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-07-12 04:54:10 | WIP patch: pass outer-relation Vars as parameters to indexscans |
Previous Message | KaiGai Kohei | 2010-07-12 04:09:08 | Re: Reworks of DML permission checks |