Lists: | pgsql-hackers |
---|
From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | multibyte charater set in levenshtein function |
Date: | 2010-05-10 15:35:02 |
Message-ID: | AANLkTinbhlfvWhT_sUOEQj1IHzJdGE-dUUZCXq-3QqYm@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hackers,
The current version of levenshtein function in fuzzystrmatch contrib modulte
doesn't work properly with multibyte charater sets.
test=# select levenshtein('фыва','аыва');
levenshtein
-------------
2
(1 row)
My patch make this function works properly with multibyte charater sets.
test=# select levenshtein('фыва','аыва');
levenshtein
-------------
1
(1 row)
Also it avoids text_to_cstring call.
Regards,
Alexander Korotkov.
Attachment | Content-Type | Size |
---|---|---|
fuzzystrmatch.diff.gz | application/x-gzip | 2.3 KB |
From: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-05-12 19:04:18 |
Message-ID: | 1273690962-sup-2257@alvh.no-ip.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Excerpts from Alexander Korotkov's message of lun may 10 11:35:02 -0400 2010:
> Hackers,
>
> The current version of levenshtein function in fuzzystrmatch contrib modulte
> doesn't work properly with multibyte charater sets.
> My patch make this function works properly with multibyte charater sets.
Great. Please add it to the next commitfest:
http://commitfest.postgresql.org
On a quick look, I didn't like the way you separated the
"pg_database_encoding_max_length() > 1" cases. There seem to be too
much common code. Can that be refactored a bit better?
--
From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-05-12 20:13:58 |
Message-ID: | AANLkTim9jUv4uWE3iL5ZDrZs18H1BQoA4U3RAAd7zBij@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>wrote:
> On a quick look, I didn't like the way you separated the
> "pg_database_encoding_max_length() > 1" cases. There seem to be too
> much common code. Can that be refactored a bit better?
>
I did a little refactoring in order to avoid some similar code.
I'm not quite sure about my CHAR_CMP macro. Is it a good idea?
Attachment | Content-Type | Size |
---|---|---|
fuzzystrmatch-0.2.diff.gz | application/x-gzip | 2.4 KB |
From: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-05-13 02:03:13 |
Message-ID: | 20100513020311.GA7628@alvh.no-ip.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Alexander Korotkov escribió:
> On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>wrote:
>
> > On a quick look, I didn't like the way you separated the
> > "pg_database_encoding_max_length() > 1" cases. There seem to be too
> > much common code. Can that be refactored a bit better?
> >
> I did a little refactoring in order to avoid some similar code.
> I'm not quite sure about my CHAR_CMP macro. Is it a good idea?
Well, since it's only used in one place, why are you defining a macro at
all?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-05-13 06:49:13 |
Message-ID: | AANLkTinjLfIASGJN_58kOVQ6kFWON0yXZngCY8kppfRa@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, May 13, 2010 at 6:03 AM, Alvaro Herrera <alvherre(at)commandprompt(dot)com
> wrote:
> Well, since it's only used in one place, why are you defining a macro at
> all?
>
In order to structure code better. My question was about another. Is memcmp
function good choice to compare very short sequences of bytes (from 1 to 4
bytes)?
From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: multibyte charater set in levenshtein function |
Date: | 2010-06-06 20:00:08 |
Message-ID: | AANLkTinEIfNbX4cZx5GfH2iHbyRIAcMfCsx6hlm5QIj7@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hello Hackers!
I have extended my patch by introducing levenshtein_less_equal function.
This function have additional argument max_d and stops calculating when
distance exceeds max_d. With low values of max_d function works much faster
than original one.
The example of original levenshtein function usage:
test=# select word, levenshtein(word, 'consistent') as dist from words where
levenshtein(word, 'consistent') <= 2 order by dist;
word | dist
-------------+------
consistent | 0
insistent | 2
consistency | 2
coexistent | 2
consistence | 2
(5 rows)
test=# explain analyze select word, levenshtein(word, 'consistent') as dist
from words where levenshtein(word, 'consistent') <= 2 order by dist;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=203.652..203.658 rows=5 loops=1)
Sort Key: (levenshtein(word, 'consistent'::text))
Sort Method: quicksort Memory: 25kB
-> Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual
time=19.019..203.601 rows=5 loops=1)
Filter: (levenshtein(word, 'consistent'::text) <= 2)
Total runtime: 203.723 ms
(6 rows)
Example of levenshtein_less_equal usage in this case:
test=# select word, levenshtein_less_equal(word, 'consistent', 2) as dist
from words where levenshtein_less_equal(word, 'consistent', 2) <= 2 order by
dist;
word | dist
-------------+------
consistent | 0
insistent | 2
consistency | 2
coexistent | 2
consistence | 2
test=# explain analyze select word, levenshtein_less_equal(word,
'consistent', 2) as dist from words where levenshtein_less_equal(word,
'consistent', 2) <= 2 order by dist;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=42.198..42.203 rows=5 loops=1)
Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2))
Sort Method: quicksort Memory: 25kB
-> Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual
time=5.391..42.143 rows=5 loops=1)
Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) <= 2)
Total runtime: 42.292 ms
(6 rows)
In the example above levenshtein_less_equal works about 5 times faster.
With best regards,
Alexander Korotkov.
Attachment | Content-Type | Size |
---|---|---|
fuzzystrmatch-0.3.diff.gz | application/x-gzip | 4.4 KB |