Re: multibyte charater set in levenshtein function

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-07-24 15:27:47
Message-ID: AANLkTi=m5aNd0a0-gw_9bvyeVapaSUEE8JfcsVonSksJ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here is new version of my patch. There are following changes:

1) I've merged singlebyte and multibyte versions of levenshtein_internal and
levenshtein_less_equal_internal using macros and includes.
2) I found that levenshtein takes reasonable time even for long strings.
There is an example with strings which contain random combinations of words.

test=# select count(*) from words;
count
-------
98569
(1 row)

test=# select * from words limit 10;
id |
word | next_id
-------+------------------------------------------------------------------------------------+---------
200 | Agnew diodes drilled composts Wassermann nonprofit adjusting
Chautauqua | 17370
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed
zither | 70983
5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal
punctuality | 96685
6103 | G prompt boron overthrew cricking begonia snuffers
blazed | 34518
4707 | Djibouti Mari pencilling approves pointing's pashas magpie
rams | 71200
10125 | Lyle Edgardo livers gruelled capable cancels gnawing's
inhospitable | 28018
9615 | Leos deputy evener yogis assault palatals Octobers
surceased | 21878
15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier
Algeria | 19316
15776 | Sucre battle's misapprehension's predicating nutmegged
electrification bowl planks | 65739
16878 | Updike Forster ragout keenly exhalation grayed switch
guava's | 43013
(10 rows)

Time: 26,125 ms

Time: 137,061 ms
test=# select avg(length(word)) from words;
avg
---------------------
74.6052207083362923
(1 row)

Time: 96,415 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2957,426 ms
test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',10)<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 84,755 ms

It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why
I changed limitation of this function to max_d <= 255 * min(ins_c, del_c)

3) I found that there is no need for storing character offsets in
levenshtein_less_equal_internal_mb. Instead of this first position in
string, where value of distance wasn't greater than max_d, can be stored.

4) I found that there is theoretical maximum distance based of string
lengths. It represents the case, when no characters are matching. If
theoretical maximum distance is less than given maximum distance we can use
theoretical maximum distance. It improves behavior of levenshtein_less_equal
with high max_d, but it's still slower than levenshtein:

test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',1000)<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 4102,731 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2983,244 ms

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
fuzzystrmatch-0.5.tar.gz application/x-gzip 5.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2010-07-24 16:22:02 Re: Incorrect FTS result with GIN index
Previous Message Pavel Stehule 2010-07-24 15:17:33 Re: patch (for 9.1) string functions