From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-07-13 07:04:16 |
Message-ID: | AANLkTilE4gCI4NnYhe3MxBurK9OiESYju8POczysrVcv@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi!
* levenshtein_internal() and levenshtein_less_equal_internal() are very
> similar. Can you merge the code? We can always use less_equal_internal()
> if the overhead is ignorable. Did you compare them?
>
With big value of max_d overhead is significant. Here is example on
american-english dictionary from Openoffice.
test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
sum
---------
1386456
(1 row)
Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from
words;
sum
---------
1386456
(1 row)
Time: 317,821 ms
> * There are many "if (!multibyte_encoding)" in levenshtein_internal().
> How about split the function into two funcs for single-byte chars and
> multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
> use multi-byte version if the overhead is small.
>
The overhead of multi-byte version was about 4 times slower. But I have
rewritten my CHAR_CMP macro with inline function. And now it's only about
1.5 times slower.
In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)
Time: 136,742 ms
In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)
Time: 88,471 ms
Anyway I think that overhead is not ignorable. That's why I have splited
levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
levenshtein_less_equal_internal_mb.
> * I prefer a struct rather than an array. "4 * m" and "3 * m" look like
> magic
> numbers for me. Could you name the entries with definition of a struct?
> /*
> * For multibyte encoding we'll also store array of lengths of
> * characters and array with character offsets in first string
> * in order to avoid great number of pg_mblen calls.
> */
> prev = (int *) palloc(4 * m * sizeof(int));
>
I this line of code the memory is allocated for 4 arrays: prev, curr,
offsets, char_lens. So I have joined offsets and char_lens into struct. But
I can't join prev and curr because of this trick:
temp = curr;
curr = prev;
prev = temp;
* There are some compiler warnings. Avoid them if possible.
> fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
> fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
> this function
> fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
> this function
> fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
> in this function
> fuzzystrmatch.c: In function ‘levenshtein_internal’:
> fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
> this function
>
Fixed.
* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
> of 'm' is an integer.
>
Fixed.
> * Need to fix the caution in docs.
> http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
> | Caution: At present, fuzzystrmatch does not work well with
> | multi-byte encodings (such as UTF-8).
> but now levenshtein supports multi-byte encoding! We should
> mention which function supports mbchars not to confuse users.
>
I've updated this notification. Also I've added documentation for
levenshtein_less_equal function.
* (Not an issue for the patch, but...)
> Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
> PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
> Unpacking versions make the core a bit faster.
>
Fixed.
With best regards,
Alexander Korotkov.
Attachment | Content-Type | Size |
---|---|---|
fuzzystrmatch-0.4.diff.gz | application/x-gzip | 5.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Rajanikant Chirmade | 2010-07-13 07:10:35 | Re: multibyte-character aware support for function "downcase_truncate_identifier()" |
Previous Message | Itagaki Takahiro | 2010-07-13 06:50:54 | Re: patch (for 9.1) string functions |