Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-09-24 20:01:24
Message-ID: AANLkTimfB+ebz_FC6Uvc=aSkkiGjMvDKzX2zdbS2CUUM@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 28, 2010 at 8:34 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Here is the patch which adds levenshtein_less_equal function. I'm going to
> add it to current commitfest.

There are some minor stylistic issues with this patch - e.g. lines
ending in whitespace, cuddled elses - but those don't look too
terribly difficult to fix. I'm more concerned about the fact that I
don't really understand the algorithm you're using. Actually, I
didn't really understand the original algorithm either until I went
and read up on it, and I just adjusted the comments to make it a bit
more clear what it's doing. That caused some minor merge conflicts
with your patch, so I'm attaching a rebased version that applies
cleanly over my changes.

Can you explain a bit more what algorithm this is using? It seems
like in the max_d >= 0 case the cells of the notional array have a
meaning which is completely different from what they mean in
otherwise, and it's not clear to me from reading the comments what
that meaning is. I took a look on that font of all human knowledge,
Wikipedia:

http://en.wikipedia.org/wiki/Levenshtein_distance

Their suggestion for handling this case is:

"If we are only interested in the distance if it is smaller than a
threshold k, then it suffices to compute a diagonal stripe of width
2k+1 in the matrix. In this way, the algorithm can be run in O(kl)
time, where l is the length of the shortest string."

It seems like that may be similar to what you're doing here but I
don't think that's exactly it. I don't think that exact thing would
work in our case anyhow because we've got configurable costs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
levenshtein_less_equal-0.2.patch application/octet-stream 15.0 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-09-27 12:42:52
Message-ID: AANLkTi=bzNBFhCVc+D=L5BP+BjHKYqd9PATBA5QMQQHF@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'll try to illustrate different approaches in examples.

k i t t e n
0 1 2 3 4 5 6
s 1 1 2 3 4 5 6
i 2 2 1 2 3 4 5
t 3 3 2 1 2 3 4
t 4 4 3 2 1 2 3
i 5 5 4 3 2 2 3
n 6 6 5 4 3 3 2
g 7 7 6 5 4 4 3

The approach mentioned in Wikipedia limits filling of the matrix by
diagonals, which are in k-distance from main diagonal (diagonal which
contain left top cell). All other cell is guaranteed to have value greater
than k. This approach can be extended to the case of costs different from 1,
in this case limit diagonals will be closer to main diagonal proportional to
the costs. Here is example of such limited matrix with k = 3.

k i t t e n
0 1 2 3
s 1 1 2 3 4
i 2 2 1 2 3 4
t 3 3 2 1 2 3 4
t 4 3 2 1 2 3
i 4 3 2 2 3
n 4 3 3 2
g 4 4 3

The first idea of my approach is to use actual cell values to limit matrix
filling. For each row the range of columns where cell values are not greater
than k is stored. And in the next row only cells are caclucated, which use
values of cells from previous row in stored range. Here is example of this
approach with k = 3. There are slightly less filled cells but calculation
are more complicated than in previoud approach.

k i t t e n
0 1 2 3
s 1 1 2 3
i 2 2 1 2 3
t 3 3 2 1 2 3
t 3 2 1 2 3
i 3 2 2 3
n 3 3 2
g 3

The second idea is to make values in matrix possible greater. I analyze what
exactly is matrix in this case. It is sum of original matrix, which
represent distances between prefixes of strings, and matrix, which represent
cost of insertions or deletions for moving to diagonal, which containing
bottom right cell. There is an example of second matrix summand:

k i t t e n
1 2 3 4 5 6 7
s 0 1 2 3 4 5 6
i 1 0 1 2 3 4 5
t 2 1 0 1 2 3 4
t 3 2 1 0 1 2 3
i 4 3 2 1 0 1 2
n 5 4 3 2 1 0 1
g 6 5 4 3 2 1 0

And an example of resulting matrix:

k i t t e n
1 3 5 7 9 11 13
s 1 2 4 6 8 10 12
i 3 2 2 4 6 8 10
t 5 4 2 2 4 6 8
t 7 6 4 2 2 4 6
i 9 8 6 4 2 3 5
n 11 10 8 6 4 3 3
g 13 12 10 8 6 5 3

The resulting matrix saves important property of original matrix, that cell
value always greater or equal than values, which are used for it's
calculation. That's why we can use idea about matrix filling limiting for
this matrix, but values in this matrix are greater and it allow to avoid
filling bigger part of matrix. There is an example of matrix filling
limiting for this matrix:

k i t t e n
1 3
s 1 2
i 2 2
t 2 2
t 2 2
i 2 3
n 3 3
g 3

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-09-28 02:52:40
Message-ID: AANLkTinin8HjBeJyMdZCaGRTr8C1O4OawLJnQcrg9-W4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 27, 2010 at 8:42 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> The second idea is to make values in matrix possible greater. I analyze what
> exactly is matrix in this case. It is sum of original matrix, which
> represent distances between prefixes of strings, and matrix, which represent
> cost of insertions or deletions for moving to diagonal, which containing
> bottom right cell. There is an example of second matrix summand:
>
>       k  i  t  t  e  n
>    1  2  3  4  5  6  7
> s  0  1  2  3  4  5  6
> i  1  0  1  2  3  4  5
> t  2  1  0  1  2  3  4
> t  3  2  1  0  1  2  3
> i  4  3  2  1  0  1  2
> n  5  4  3  2  1  0  1
> g  6  5  4  3  2  1  0
>
> And an example of resulting matrix:
>
>       k  i  t  t  e  n
>    1  3  5  7  9  11 13
> s  1  2  4  6  8  10 12
> i  3  2  2  4  6  8  10
> t  5  4  2  2  4  6  8
> t  7  6  4  2  2  4  6
> i  9  8  6  4  2  3  5
> n  11 10 8  6  4  3  3
> g  13 12 10 8  6  5  3
>
> The resulting matrix saves important property of original matrix, that cell
> value always greater or equal than values, which are used for it's
> calculation.

Hmm. So if I understand correctly (and it's possible that I don't),
the idea here is that for each cell in the matrix, we store the sum of
the following two quantities:

1. The cost of transforming an initial substring of s into an initial
substring of t (as before), and
2. The smallest imaginable cost of transforming the remaining portion
of s into the remaining portion of t, based on the difference in the
string lengths.

Thus, any cell whose value is already > max_d can't be relevant to the
final result.

The code seems a bit complex, though. In particular, I'm wondering if
we couldn't simplify things by leaving the meaning of the cells in the
matrix unchanged and just using this calculation to adjust the lower
and upper bounds for each scan. Perhaps instead of
curr/prev_left/right we could call them min_i and max_i, and after
each inner loop from min_i to max_i we could try to increment min_i
and decrement max_i based on whether cur[min/max_i] + the smallest
possible residual cost > max_d. Then you've got to also increment
max_i once for each value of j. Does that make any sense or am I all
wet?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-04 20:19:18
Message-ID: AANLkTinL406w209Nh5MSxjyAVVh3UTGWknpL_VQAgWMk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've reworked patch with your suggestion. In this version I found a little
slowdown in comparison with previous version:

SELECT * FROM words WHERE levenshtein_less_equal(a, 'extensize', 2) <= 2;
48,069 ms => 57,875 ms
SELECT * FROM words2 WHERE levenshtein_less_equal(a, 'клубничный', 3) <= 2;
100,073 ms => 113,975 ms
select * from phrases where levenshtein_less_equal('kkkknucklehead
courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens
unseemly', a, 10) <= 10;
22,876 ms => 24,721 ms
test=# select * from phrases2 where levenshtein_less_equal('таяй
раскупорившийся передислоцируется юлианович праздничный лачужка присыхать
опппливший ффехтовальный уууудобряющий', a, 10) <= 10;
55,405 ms => 57,760 ms

I think it is caused by multiplication operation for each bound
movement. Probably,
this slowdown is ignorable or there is some way to achieve the same
performance.

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
levenshtein_less_equal-0.3.patch text/x-patch 15.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-08 01:52:26
Message-ID: AANLkTikZsCsFnzjzhggbV9brfPrn+nYnZo1D=2i3VktH@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/10/4 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> I've reworked patch with your suggestion. In this version I found a little
> slowdown in comparison with previous version:
> SELECT * FROM words WHERE levenshtein_less_equal(a, 'extensize', 2) <= 2;
> 48,069 ms => 57,875 ms
> SELECT * FROM words2 WHERE levenshtein_less_equal(a, 'клубничный', 3) <= 2;
> 100,073 ms => 113,975 ms
> select * from phrases where levenshtein_less_equal('kkkknucklehead
> courtliest   sapphires be coniferous emolument antarctic Laocoon''s deadens
> unseemly', a, 10) <= 10;
> 22,876 ms => 24,721 ms
> test=# select * from phrases2 where levenshtein_less_equal('таяй
> раскупорившийся передислоцируется юлианович праздничный лачужка присыхать
> опппливший ффехтовальный уууудобряющий', a, 10) <= 10;
> 55,405 ms => 57,760 ms
> I think it is caused by multiplication operation for each bound
> movement. Probably, this slowdown is ignorable or there is some way
> to achieve the same performance.

This patch doesn't apply cleanly. It also seems to revert some recent
commits to fuzzystrmatch.c. Can you please send a corrected version?

[rhaas pgsql]$ patch -p1 < ~/Downloads/levenshtein_less_equal-0.3.patch
patching file contrib/fuzzystrmatch/fuzzystrmatch.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 5.
Hunk #8 FAILED at 317.
Hunk #9 succeeded at 543 (offset 10 lines).
Hunk #10 succeeded at 567 (offset 10 lines).
Hunk #11 succeeded at 578 (offset 10 lines).
2 out of 11 hunks FAILED -- saving rejects to file
contrib/fuzzystrmatch/fuzzystrmatch.c.rej
patching file contrib/fuzzystrmatch/fuzzystrmatch.sql.in
patching file doc/src/sgml/fuzzystrmatch.sgml

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-08 16:51:40
Message-ID: AANLkTi=r6E5eRMoo7f9YEf88pKfGHyxM99bpFYM9Fr0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry, I'm pretty *unconversant in git. Now, it should be ok.*

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
levenshtein_less_equal-0.3.1.patch text/x-patch 15.3 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 05:49:27
Message-ID: AANLkTim-eBc+FtOYZ+zRfi9t_m6gHsxWYUR6fX=qvu3W@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 8, 2010 at 12:51 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Sorry, I'm pretty unconversant in git. Now, it should be ok.

I spent some time hacking on this. It doesn't appear to be too easy
to get levenshtein_less_equal() working without slowing down plain old
levenshtein() by about 6%. The slowdown was similar on the 0.2
version of the patch, the 0.3.1 version of the patch, and another
version I put together that's a bit different from either. The test
query was:

SELECT * FROM words WHERE levenshtein(a, 'extensize') <= 2;

In the version I was experimenting with, I had a large chunk of code
that looked like this:

if (max_d >= 0)
{
/* Do stuff */
}

Removing that block of code buys back most of the performance loss
that occurs in the case where max_d < 0. I have to conclude that
sticking more stuff into the main loop defeats some optimization
opportunities that would otherwise be available, even in the case
where the branch isn't taken.

So I'm tempted to employ the previously-discussed sledgehammer of
compiling levenshtein_internal twice, with and without support for
max_d. A patch taking this approach is attached. I'm not totally
sold on this approach, if someone has a constructive suggestion.
Comments?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
levenshtein_less_equal-0.4.patch application/octet-stream 23.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 13:32:36
Message-ID: 24409.1286976756@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I spent some time hacking on this. It doesn't appear to be too easy
> to get levenshtein_less_equal() working without slowing down plain old
> levenshtein() by about 6%.

Is that really enough slowdown to be worth contorting the code to avoid?
I've never heard of an application where the speed of this function was
the bottleneck.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 14:18:01
Message-ID: 1286979419-sup-9868@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Tom Lane's message of mié oct 13 10:32:36 -0300 2010:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > I spent some time hacking on this. It doesn't appear to be too easy
> > to get levenshtein_less_equal() working without slowing down plain old
> > levenshtein() by about 6%.
>
> Is that really enough slowdown to be worth contorting the code to avoid?
> I've never heard of an application where the speed of this function was
> the bottleneck.

What if it's used on a expression index on a large table?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 14:51:59
Message-ID: 26040.1286981519@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of mi oct 13 10:32:36 -0300 2010:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> I spent some time hacking on this. It doesn't appear to be too easy
>>> to get levenshtein_less_equal() working without slowing down plain old
>>> levenshtein() by about 6%.
>>
>> Is that really enough slowdown to be worth contorting the code to avoid?
>> I've never heard of an application where the speed of this function was
>> the bottleneck.

> What if it's used on a expression index on a large table?

So? Expression indexes don't result in multiple evaluations of the
function. If anything, that context would probably be even less
sensitive to the function runtime than non-index use.

But the main point is that 6% performance penalty in a non-core function
is well below my threshold of pain. If it were important enough to care
about that kind of performance difference, it'd be in core. I'd rather
see us keeping the code simple, short, and maintainable.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 15:22:17
Message-ID: AANLkTimPZMQ=yMuZazrH8v3VP5opo4Ywu5sFxFtMFMo2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 13, 2010 at 10:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Excerpts from Tom Lane's message of mié oct 13 10:32:36 -0300 2010:
>>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>>> I spent some time hacking on this.  It doesn't appear to be too easy
>>>> to get levenshtein_less_equal() working without slowing down plain old
>>>> levenshtein() by about 6%.
>>>
>>> Is that really enough slowdown to be worth contorting the code to avoid?
>>> I've never heard of an application where the speed of this function was
>>> the bottleneck.
>
>> What if it's used on a expression index on a large table?
>
> So?  Expression indexes don't result in multiple evaluations of the
> function.  If anything, that context would probably be even less
> sensitive to the function runtime than non-index use.
>
> But the main point is that 6% performance penalty in a non-core function
> is well below my threshold of pain.  If it were important enough to care
> about that kind of performance difference, it'd be in core.  I'd rather
> see us keeping the code simple, short, and maintainable.

Well, then you have to wonder whether it's worth having the
lesss-than-or-equal-to version around at all. That's only about 2x
faster on the same test case. I do think it's likely that people who
call this function will call it many times, however - e.g. trying to
find the closest matches from a dictionary for a given input string,
so the worry about performance doesn't seem totally out of place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 15:42:28
Message-ID: 27008.1286984548@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Oct 13, 2010 at 10:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> But the main point is that 6% performance penalty in a non-core function
>> is well below my threshold of pain.

> Well, then you have to wonder whether it's worth having the
> lesss-than-or-equal-to version around at all. That's only about 2x
> faster on the same test case.

"Same" test case? I thought they did different things?

> I do think it's likely that people who
> call this function will call it many times, however - e.g. trying to
> find the closest matches from a dictionary for a given input string,
> so the worry about performance doesn't seem totally out of place.

No doubt, but the actual function runtime is only one component of the
cost of applying it to a lot of dictionary entries --- I would think
that the table read costs are the larger component anyway.

regards, tom lane


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 15:53:35
Message-ID: AANLkTik5AkOahj3GL6ssZCOaRWgSmcc=Pp3kj4vdnZyb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> No doubt, but the actual function runtime is only one component of the
> cost of applying it to a lot of dictionary entries --- I would think
> that the table read costs are the larger component anyway.

Data domain can be not only dictionary but also something like article
titles, urls and so on. On such relatively long strings (about 100
characters and more) this component will be significant (especially if most
part of the table is lying in cache). In this case search of near strings
can be accelerated in more than 10 times. I think that this use case
justifies presence of separate leveshtein_less_equal function.

----
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 15:58:32
Message-ID: AANLkTinnNo6EL46=Q-Bu86o0aLg8Ux7ayNB6uspqE1LT@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> In this case search of near strings can be accelerated in more than 10
> times.

I mean that component of function runtime can be accelerated in more than 10
times. Of course, actual overall speedup depends on many other factors, but
I belive that it also can be significant.

----
With best regards,
Alexander Korotkov.


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 16:23:22
Message-ID: 201010131823.23841.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday 13 October 2010 16:18:01 Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of mié oct 13 10:32:36 -0300 2010:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > > I spent some time hacking on this. It doesn't appear to be too easy
> > > to get levenshtein_less_equal() working without slowing down plain old
> > > levenshtein() by about 6%.
> >
> > Is that really enough slowdown to be worth contorting the code to avoid?
> > I've never heard of an application where the speed of this function was
> > the bottleneck.
>
> What if it's used on a expression index on a large table?
Its hard to use it as an sensible expression index, given that you use it to
calculate difference between two strings.

Whats more important is, that its used for sorting the results of a query -
where its more important that its fast.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-10-13 16:30:25
Message-ID: AANLkTinHk2zwTUCsOFrN-MbNdVbB-x-U0ymtifLGPMJt@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 13, 2010 at 11:42 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Oct 13, 2010 at 10:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> But the main point is that 6% performance penalty in a non-core function
>>> is well below my threshold of pain.
>
>> Well, then you have to wonder whether it's worth having the
>> lesss-than-or-equal-to version around at all.  That's only about 2x
>> faster on the same test case.
>
> "Same" test case?  I thought they did different things?

levenshtein_less_equal(a, b, max_d) returns the same value as
levenshtein(a, b) if levenshtein(a, b) <= max_d. Otherwise it returns
max_d + 1. So it's the same test case with a small distance bound (2)
applied. As Alexander says, the value of levenshtein_less_equal
accelerates pretty rapidly when long strings are involved, so it seems
worth having, but I'm not sure I agree that the slowdown to the basic
function is negligible. It is not really all that much #ifdef hackery
to avoid it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company