Re: multibyte charater set in levenshtein function

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 14:48:37
Message-ID: AANLkTim0ZZs8zp6DM7qowV4jwyYhy1Dy+_fN6ApEWBf5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2010/8/2 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I reviewed this code in a fair amount of detail today and ended up
>> rewriting it.  In general terms, it's best to avoid changing things
>> that are not relevant to the central purpose of the patch.  This patch
>> randomly adds a whole bunch of whitespace and removes a whole bunch of
>> comments which I find useful and do not want to see removed.  It took
>> me about an hour to reverse out all of those changes, which then let
>> me get down to the business of actually analyzing the code.
> I'm sorry. This is my first patch, I will be more accurate next time. But, I
> think there is no unified opinion about some changes like replacement "!m"
> with "m==0".

Sure; if that were the only such change, I wouldn't have mentioned it.

> I think this line is not correct:
> "if (m != s_bytes && n != t_bytes)"

Good catch, fixed in the attached version.

>> The attached patch takes the approach of using a fast-path for just
>> the innermost loop when neither string contains any multi-byte
>> characters (regardless of whether or not the encoding allows
>> multi-byte characters).  It's still slower than CVS HEAD, but for
>> strings containing no multi-byte characters it's only marginally, if
>> at all, slower than previous releases, because 9.0 doesn't contain
>> your fix to avoid the extra string copy; and it's faster than your
>> implementation even when multi-byte characters ARE present.  It is
>> slightly slower on single-byte encoding databases (I tested
>> SQL_ASCII), but still faster than previous releases.  It's also quite
>> a bit less code duplication.
>
> Can I look at the test which shows that your implementation is faster than
> my when multi-byte characters are present? My tests show reverse. For
> example, with russian dictionary of openoffice:

Hmm, with the correction you mention above, I'm getting about the same
results with multi-byte characters (but faster with only single-byte
characters). I did this:

CREATE TABLE words (a text not null primary key);
COPY words from '/usr/share/dict/words';

SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;

With the updated patch attached below, these ran about 102 ms, 222 ms,
129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps
this is because I only have multi-byte characters in one of the
inputs, not both. I don't have a dictionary handy with multi-byte
words in it. Can you send me a file?

> I think that the cause of slow down slowdown is memcmp function. This
> function is very fast for long parts of memory, but of few bytes inline
> function is faster, because compiler have more freedom for optimization.
> After replacing memcmp with inline function like following in your
> implementation:

Yeah, that does seem to be slightly better. I've done a version of
this in the attached patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
mb_levenshtein_rmh-v2.patch application/octet-stream 5.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc Cousin 2010-08-02 15:27:07 Re: lock_timeout GUC patch - Review
Previous Message Yeb Havinga 2010-08-02 14:47:29 Re: patch for check constraints using multiple inheritance