Re: multibyte charater set in levenshtein function

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multibyte charater set in levenshtein function
Date: 2010-08-02 01:20:34
Message-ID: AANLkTik-BLr8tumY1eBxQM4SYDbhZRkoynRp3N+q07Cc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Ok, here is the patch for multi-byte characters.
> I changed arguments of levenshtein_internal function from text * to const
> char * and int. I think that it makes levenshtein_internal more reusable.
> For example, this function can be used for calculation distance of two
> fragments of strings without need of copying of these fragments.
> Actullay, I'm not fully satisfied with my solution in merging of single-byte
> and multi-byte versions of function. There are "ifdef"'s in body of function
> and non context-free macros. This is due to multi-byte needs to store
> characters lengths returned by pg_mblen when single-byte version doesn't
> need that. I tried multi-byte version without storing characters lengths,
> but in this case function is about 1.5 times slower (function needs to call
> pg_mblen n*m times).

I reviewed this code in a fair amount of detail today and ended up
rewriting it. In general terms, it's best to avoid changing things
that are not relevant to the central purpose of the patch. This patch
randomly adds a whole bunch of whitespace and removes a whole bunch of
comments which I find useful and do not want to see removed. It took
me about an hour to reverse out all of those changes, which then let
me get down to the business of actually analyzing the code. The
biggest problem I found is that, as coded, this is more than 50%
slower on UTF-8 databases, because the multi-byte path is really slow,
even if there are actually no multi-byte characters present.

The attached patch takes the approach of using a fast-path for just
the innermost loop when neither string contains any multi-byte
characters (regardless of whether or not the encoding allows
multi-byte characters). It's still slower than CVS HEAD, but for
strings containing no multi-byte characters it's only marginally, if
at all, slower than previous releases, because 9.0 doesn't contain
your fix to avoid the extra string copy; and it's faster than your
implementation even when multi-byte characters ARE present. It is
slightly slower on single-byte encoding databases (I tested
SQL_ASCII), but still faster than previous releases. It's also quite
a bit less code duplication.

Please review and let me know your thoughts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
mb_levenshtein_rmh.patch application/octet-stream 5.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-08-02 01:57:14 Re: (9.1) btree_gist support for searching on "not equals"
Previous Message Tom Lane 2010-08-02 00:56:50 Re: Compiling CVS HEAD with clang under OSX