Re: multibyte charater set in levenshtein function

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 10:57:18
Message-ID: AANLkTimpvXnGZdJjDf3iTKq8=FnQB00LK6tPAb1uJfBr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I reviewed this code in a fair amount of detail today and ended up
> rewriting it. In general terms, it's best to avoid changing things
> that are not relevant to the central purpose of the patch. This patch
> randomly adds a whole bunch of whitespace and removes a whole bunch of
> comments which I find useful and do not want to see removed. It took
> me about an hour to reverse out all of those changes, which then let
> me get down to the business of actually analyzing the code.

I'm sorry. This is my first patch, I will be more accurate next time. But, I
think there is no unified opinion about some changes like replacement "!m"
with "m==0".

I think this line is not correct:
"if (m != s_bytes && n != t_bytes)"
The correct negation of assumption, that both strings are single-byte, is
the assumption, that at least one string is not single-byte. In this patch
levenshtein function can calculate distance incorrectly:
test=# select levenshtein('фw', 'ww');
levenshtein
-------------
2
(1 row)
This line should be rewritten by this.
"if (m != s_bytes || n != t_bytes)"

The attached patch takes the approach of using a fast-path for just
> the innermost loop when neither string contains any multi-byte
> characters (regardless of whether or not the encoding allows
> multi-byte characters). It's still slower than CVS HEAD, but for
> strings containing no multi-byte characters it's only marginally, if
> at all, slower than previous releases, because 9.0 doesn't contain
> your fix to avoid the extra string copy; and it's faster than your
> implementation even when multi-byte characters ARE present. It is
> slightly slower on single-byte encoding databases (I tested
> SQL_ASCII), but still faster than previous releases. It's also quite
> a bit less code duplication.
>

Can I look at the test which shows that your implementation is faster than
my when multi-byte characters are present? My tests show reverse. For
example, with russian dictionary of openoffice:

With my version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
sum
---------
1675281
(1 row)

Time: 277,190 ms

With your version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
sum
---------
1675281
(1 row)

Time: 398,011 ms

I think that the cause of slow down slowdown is memcmp function. This
function is very fast for long parts of memory, but of few bytes inline
function is faster, because compiler have more freedom for optimization.
After replacing memcmp with inline function like following in your
implementation:

static inline bool char_cmp(const char *s, const char *d, int len)
{
while (len--)
{
if (*s++ != *d++)
return false;
}
return true;
}

Code becomes much faster:

test=# select sum(levenshtein(word, 'фывафыва')) from test;
sum
---------
1675281
(1 row)

Time: 241,272 ms

----
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-08-02 10:58:56 Re: Postgres as Historian
Previous Message Robert Haas 2010-08-02 10:53:58 Re: Synchronous replication