strpos() && KMP

Lists: pgsql-patches
From: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
To: pgsql-patches(at)postgresql(dot)org
Subject: strpos() && KMP
Date: 2007-08-01 20:18:22
Message-ID: 1167412014.20070802011822@acm.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Hello,

this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function
(see Cormen et al. Introduction to Algorithms, MIT Press, 2001).

It also works with multibyte wchar.

In worst case current brute force strpos() takes O(n * m) (n && m is length of strings)
time (example: 'aaa...aaab' search in 'aaa...aaa').
KMP algo always takes O(n + m) time.
To check this someone need to create a table with one text attribute, and insert several thousands
record 'aa..aa'(for example, with lenght = 1000) . After execute "select count(*) from test where
strpos(a, 'aaa....aab') > 0;" on current and modified version.

Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'"
(strpos must be faster than regex).

In general, this belongs to artificial expressions. In natural language KMP is equal (execution time)
current strpos() nearly.

----
Ajtkulov Pavel
ajtkulov(at)acm(dot)org

P. S. Sorry for prime English.

Attachment Content-Type Size
varlena.c.patch application/octet-stream 5.4 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-01 22:41:31
Message-ID: 46B10C1B.3060802@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Pavel Ajtkulov wrote:
> Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'"
> (strpos must be faster than regex).
>
>

The LIKE code does not use the regex engine. See
src/backend/utils/adt/like.c and like_match.c - which recently got an
efficiency makeover, especially for MB charsets (and most especially for
UTF8).

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-01 23:12:35
Message-ID: 26247.1186009955@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Pavel Ajtkulov <ajtkulov(at)acm(dot)org> writes:
> this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function
> (see Cormen et al. Introduction to Algorithms, MIT Press, 2001).

This seems like a lot of complexity added to fix a non-problem. We've
had no complaints about the speed of those functions (at least not since
we changed the original O(N^2) method :-().

regards, tom lane


From: Decibel! <decibel(at)decibel(dot)org>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-03 23:41:03
Message-ID: 20070803234102.GX25704@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Thu, Aug 02, 2007 at 01:18:22AM +0500, Pavel Ajtkulov wrote:
> Hello,
>
> this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function
> (see Cormen et al. Introduction to Algorithms, MIT Press, 2001).
>
> It also works with multibyte wchar.
>
> In worst case current brute force strpos() takes O(n * m) (n && m is length of strings)
> time (example: 'aaa...aaab' search in 'aaa...aaa').
> KMP algo always takes O(n + m) time.
> To check this someone need to create a table with one text attribute, and insert several thousands
> record 'aa..aa'(for example, with lenght = 1000) . After execute "select count(*) from test where
> strpos(a, 'aaa....aab') > 0;" on current and modified version.
>
> Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'"
> (strpos must be faster than regex).
>
> In general, this belongs to artificial expressions. In natural language KMP is equal (execution time)
> current strpos() nearly.

Do you have any performance test results for this?
--
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
To: decibel(at)decibel(dot)org
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-07 19:19:53
Message-ID: 1985342014.20070808001953@acm.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


> Do you have any performance test results for this?

I describe the worst case in first message (search 'aaa..aab' in
'aa..aa', "complete" N^2). It works some msec instead of several sec
(in current version).

Patch intends for artificial language (for example DNA, or
language with small alphabet, or regular language) only.

In natural language, KMP(and other search algo) haven't notable
advantages (+-5% time execution).

----
Ajtkulov Pavel
ajtkulov(at)acm(dot)org


From: Decibel! <decibel(at)decibel(dot)org>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-07 21:45:38
Message-ID: 20070807214538.GS25704@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Wed, Aug 08, 2007 at 12:19:53AM +0500, Pavel Ajtkulov wrote:
>
> > Do you have any performance test results for this?
>
> I describe the worst case in first message (search 'aaa..aab' in
> 'aa..aa', "complete" N^2). It works some msec instead of several sec
> (in current version).

You describe the test case but don't provide any results. Providing the
test case so others can duplicate your results is good, but you should
provide actual results as well. You should also make sure to test any
cases where performance would actually degrade so we can see what that
looks like (though I don't know if that's an issue in this case).
--
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: decibel(at)decibel(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-07 22:02:48
Message-ID: 29290.1186524168@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Pavel Ajtkulov <ajtkulov(at)acm(dot)org> writes:
> Patch intends for artificial language (for example DNA, or
> language with small alphabet, or regular language) only.
> In natural language, KMP(and other search algo) haven't notable
> advantages (+-5% time execution).

I wonder why you didn't propose Boyer-Moore instead, as that would have
some advantage for natural language text as well.

The difficulty with B-M is the need for a table indexed by character
code, which at first glance looks impractical for wchars. But it seems
to me that we could use "wchar % 256" as the table index, meaning that
wchars with the same trailing byte share the same table entry. That
would lose some efficiency compared to an exact implementation, but the
limited table size would outweigh that except in the most pathological
cases.

regards, tom lane


From: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-08 18:51:13
Message-ID: 656957725.20070808235113@acm.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Tom Lane writes:

> I wonder why you didn't propose Boyer-Moore instead, as that would have
> some advantage for natural language text as well.

> The difficulty with B-M is the need for a table indexed by character
> code, which at first glance looks impractical for wchars. But it seems
> to me that we could use "wchar % 256" as the table index, meaning that
> wchars with the same trailing byte share the same table entry. That
> would lose some efficiency compared to an exact implementation, but the
> limited table size would outweigh that except in the most pathological
> cases.

hash table?

The main difficulty with BM is verification and understanding "good
suffix shift" (the second part of BM) (I don't understand it entirely).

----
Ajtkulov Pavel
ajtkulov(at)acm(dot)org


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-09 02:24:20
Message-ID: 12142.1186626260@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Pavel Ajtkulov <ajtkulov(at)acm(dot)org> writes:
> Tom Lane writes:
>> The difficulty with B-M is the need for a table indexed by character
>> code, which at first glance looks impractical for wchars. But it seems
>> to me that we could use "wchar % 256" as the table index, meaning that
>> wchars with the same trailing byte share the same table entry.

> hash table?

I'd think the cost of hashing would render it impractical. Most of the
papers I've seen on this topic worry about getting single instructions
out of the search loop --- a hash lookup will cost lots more than that.
Moreover, you'd lose the guarantee of not-worse-than-linear time,
because hash lookup can be pathologically bad if you get a lot of hash
collisions.

> The main difficulty with BM is verification and understanding "good
> suffix shift" (the second part of BM) (I don't understand it entirely).

Yeah, there seem to be a bunch of variants of BM (many of them not
guaranteed linear, which I'm sure we don't want) and the earliest
papers had bugs. But a good implementation would be a lot easier
sell because it would show benefits for a much wider set of use-cases
than KMP.

regards, tom lane


From: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-10 20:27:49
Message-ID: 16410532421.20070811012749@acm.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Tom Lane writes:

>> hash table?

> I'd think the cost of hashing would render it impractical. Most of the
> papers I've seen on this topic worry about getting single instructions
> out of the search loop --- a hash lookup will cost lots more than that.
> Moreover, you'd lose the guarantee of not-worse-than-linear time,
> because hash lookup can be pathologically bad if you get a lot of hash
> collisions.

compute max_wchar, min_wchar. If (d = max_wchar - min_wchar) < k (for
example, k = 1000), then we use index table (wchar -> wchar -
min_wchar). Else we use hash table. Number of collisions would be a
few (because hash table needs for pattern characters only. Characters
located serially, hash function = whchar % const).

>> The main difficulty with BM is verification and understanding "good
>> suffix shift" (the second part of BM) (I don't understand it entirely).

> Yeah, there seem to be a bunch of variants of BM (many of them not
> guaranteed linear, which I'm sure we don't want) and the earliest
> papers had bugs. But a good implementation would be a lot easier
> sell because it would show benefits for a much wider set of use-cases
> than KMP.

Is there requirement for some string mathching algorithms/data
structure(suffix array/tree) in PG? or "We've
had no complaints about the speed of those functions".

----
Ajtkulov Pavel
ajtkulov(at)acm(dot)org


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-08-11 05:15:12
Message-ID: 22661.1186809312@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Pavel Ajtkulov <ajtkulov(at)acm(dot)org> writes:
> Tom Lane writes:
>> Moreover, you'd lose the guarantee of not-worse-than-linear time,
>> because hash lookup can be pathologically bad if you get a lot of hash
>> collisions.

> compute max_wchar, min_wchar. If (d = max_wchar - min_wchar) < k (for
> example, k = 1000), then we use index table (wchar -> wchar -
> min_wchar). Else we use hash table. Number of collisions would be a
> few (because hash table needs for pattern characters only.

I think you missed my point: there's a significant difference between
"guaranteed good performance" and "probabilistically good performance".
Even when the probably-good algorithm wins for typical cases, there's a
strong argument to be made for guarantees. The problem you set out to
solve really is that an algorithm that's all right in everyday cases
will suck in certain uncommon cases --- so why do you want to fix it
by just moving around the cases in which it fails to do well?

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Pavel Ajtkulov <ajtkulov(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: strpos() && KMP
Date: 2007-09-26 08:48:11
Message-ID: 200709260848.l8Q8mBv16140@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


Added to TODO:

> * Implement Boyer-Moore searching in strpos()
>
> http://archives.postgresql.org/pgsql-patches/2007-08/msg00012.php

---------------------------------------------------------------------------

Pavel Ajtkulov wrote:
> Hello,
>
> this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function
> (see Cormen et al. Introduction to Algorithms, MIT Press, 2001).
>
> It also works with multibyte wchar.
>
> In worst case current brute force strpos() takes O(n * m) (n && m is length of strings)
> time (example: 'aaa...aaab' search in 'aaa...aaa').
> KMP algo always takes O(n + m) time.
> To check this someone need to create a table with one text attribute, and insert several thousands
> record 'aa..aa'(for example, with lenght = 1000) . After execute "select count(*) from test where
> strpos(a, 'aaa....aab') > 0;" on current and modified version.
>
> Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. where attr like '%word%'"
> (strpos must be faster than regex).
>
> In general, this belongs to artificial expressions. In natural language KMP is equal (execution time)
> current strpos() nearly.
>
>
> ----
> Ajtkulov Pavel
> ajtkulov(at)acm(dot)org
>
> P. S. Sorry for prime English.

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +