Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

Lists: pgsql-hackerspgsql-patches
From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-08-30 23:21:11
Message-ID: D558A7BC159740A1BEF36EE8997B64DD@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Reference: Bruce Momjian writes: ->
http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php

Other references: Boyer Moore?? ->
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-exa
mple.html

As a first time hacker of postgresql I've decided to attempt something that
does not go too deep in to the heart of the software.

Before I make an attempt at writing a patch, I've decided to start
benchmarking outside of Postgres. On reading the link underneath the Boyer
Moore item in the in the new TODO list wiki, I learned that someone else had
a shot at speeding up strpos around this time last year using KMP algotithm.
On looking a little deeper into Boyer Moore's algorithm it's quite obvious
that there is some overhead involved with preprocessing the search key
(needle). My target has been to not have any trade offs with this possibly
to be patch. I didn't want smaller searches to suffer.

I'd like to receive some feedback on the various functions in the attached
before I go ahead and work on this any more. I've written 8 different
versions of the function. Version 8 seems to be the fastest out of them,
however I've not tested with multi byte chars.

Please note that these are just some initial ideas. For example startpos is
not yet implemented, but there is very little overhead to that anyway.

For comparision sake I mocked up the current postgresql 8.3's strpos() from
varlena.c. I've no idea how accuratly this will be to the real version.
Probably find out once I write the patch. I replaced strncmp with my own
version as I was having problems with my compilers optimizer, it optimized
too much and I was unable to benchmark. I felt running without and
optimization was unfair as strncmp is. Perhaps if someone else wants to
benchmark with -O2 on gcc they should first replace pg_strncmp with strncmp.

Currently I have a single file which can be compiled itself that contains
benchmarks for the functions I've tested so far. All of these implement a
stripped out version of the Boyer Moore algotithm. I've done away with one
of the character maps in a bid to reduce the overhead involved creating
them.

The position calculation will likely need changed for multi byte characters.
I'm suspecting I'm not meant to be doing length calculations on pointers.
Can someone confirm?

To keep this email short I've put most of the information required in as
comments into the source of the program.

I've uploaded some benchmark results using the attached program. The
benchmarking was done on windows.

Please see http://www.unixbeast.com/~fat/test1.html there are 10 tests in
total, in each I attempt to exploit weaknesses, some with my functions and
some with the existing function.

My method for version 8 of the function probably looks quite unusual as it
does some simple cost based optimisation

I look forward to receiving feedback on this.

David.

Attachment Content-Type Size
BMStrpos.tar application/x-tar 38.4 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: David Rowley <dgrowley(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-02 11:59:37
Message-ID: 48BD2AA9.7070509@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

David Rowley wrote:
> Reference: Bruce Momjian writes: ->
> http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php
>
> Other references: Boyer Moore?? ->
> http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html

> I look forward to receiving feedback on this.

Please send a patch in diff -c format. And don't put a single patch
file in a tar file.


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-02 23:19:26
Message-ID: CC00B041B12F40749EE43D49D99A705D@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> I look forward to receiving feedback on this.

Peter Eisentraut wrote:
> Please send a patch in diff -c format. And don't put a single patch
> file in a tar file.

Thanks for the pointer. I'm quite new to this.

I've done some more revisions to the patch. This has mostly just involved
tuning the skip table size based on the size of the search. This has
basically involved lots of benchmarks with different strings and calculating
the best size of table to use. The reason for this is to maintain fast
searches for smaller strings. The overhead of initialising a 256 element
array would probably out weigh the cost of the search if this were not done.
The size of the skip table increases with longer strings, or rather the size
that is utilised.

Performance:
For smaller searches performance of the patch and existing version are very
similar. The patched version starts to out perform the existing version when
the needle and haystack become larger.

The patch wins hands down with searches that leads the existing function in
to dead ends, for example:

SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA');

When searching for very small strings, say just a single character in a
sentence the existing function marginally beats the patched version.

Outside of Postgres I've done benchmarks where I've searched for every
combination of the search string in the search string. Like:

test | t |
test | te |
test | tes |
test | test |
test | e |
test | es |
test | est |
test | s |
test | st |
test | t |

I felt this was fair for both versions. The patched version beat the
unpatched version. The test I carried out was a string of 934 characters.

I can upload more benchmarks if required.

I'm quite happy with the patch now so I'm going to submit it (in diff -c
format this time :)

David.

Attachment Content-Type Size
boyer-moore_strpos_v1.1.patch application/octet-stream 7.1 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: David Rowley <dgrowley(at)gmail(dot)com>
Cc: 'Peter Eisentraut' <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-04 22:31:15
Message-ID: 48C061B3.1090709@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

After reading the wikipedia article on Boyer-Moore search algorithm, it
looks to me like this patch actually implements the simpler
Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
probably fine, as it ought to be faster on small needles and haystacks
because it requires less effort to build the lookup tables, even though
the worst-case performance is worse. It should still be faster than what
we have now.

The skip table really should be constructed only once in
text_position_start and stored in TextPositionState. That would make a
big difference to the performance of those functions that call
text_position_next repeatedly: replace_text, split_text and text_to_array.

David Rowley wrote:
> I've done some more revisions to the patch. This has mostly just involved
> tuning the skip table size based on the size of the search. This has
> basically involved lots of benchmarks with different strings and calculating
> the best size of table to use. The reason for this is to maintain fast
> searches for smaller strings. The overhead of initialising a 256 element
> array would probably out weigh the cost of the search if this were not done.
> The size of the skip table increases with longer strings, or rather the size
> that is utilised.
>
> Performance:
> For smaller searches performance of the patch and existing version are very
> similar. The patched version starts to out perform the existing version when
> the needle and haystack become larger.
>
> The patch wins hands down with searches that leads the existing function in
> to dead ends, for example:
>
> SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA');
>
> When searching for very small strings, say just a single character in a
> sentence the existing function marginally beats the patched version.
>
> Outside of Postgres I've done benchmarks where I've searched for every
> combination of the search string in the search string. Like:
>
> test | t |
> test | te |
> test | tes |
> test | test |
> test | e |
> test | es |
> test | est |
> test | s |
> test | st |
> test | t |
>
>
> I felt this was fair for both versions. The patched version beat the
> unpatched version. The test I carried out was a string of 934 characters.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: David Rowley <dgrowley(at)gmail(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-05 02:04:57
Message-ID: 11787.1220580297@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> After reading the wikipedia article on Boyer-Moore search algorithm, it
> looks to me like this patch actually implements the simpler
> Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
> probably fine, as it ought to be faster on small needles and haystacks
> because it requires less effort to build the lookup tables, even though
> the worst-case performance is worse. It should still be faster than what
> we have now.

Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
pattern, N = length of search string); whereas full B-M is O(N).
Maybe we should build the second table when M is large? Although
realistically that is probably gilding the lily, since frankly there
haven't been many real-world complaints about the speed of these
functions anyway ...

> The skip table really should be constructed only once in
> text_position_start and stored in TextPositionState. That would make a
> big difference to the performance of those functions that call
> text_position_next repeatedly: replace_text, split_text and text_to_array.

+1. The heuristic about big a skip table to make may need some
adjustment as well, since it seems to be considering only a single
search.

regards, tom lane


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-05 17:02:54
Message-ID: 2F2A1443F6064162BECA6BB37625222E@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> After reading the wikipedia article on Boyer-Moore search algorithm, it
> looks to me like this patch actually implements the simpler
> Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
> probably fine, as it ought to be faster on small needles and haystacks
> because it requires less effort to build the lookup tables, even though
> the worst-case performance is worse. It should still be faster than what
> we have now.

Yes, correct, I really didn't want to slow down smaller searches. This
method seemed to fit the bill better. What I didn't realise is that this
method also had a name.

Tom Lane wrote:
> Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
> pattern, N = length of search string); whereas full B-M is O(N). Maybe we
> should build the second table when M is large?

I'll look into this. If it pays off for longer searches I'll submit a patch.
I won't have the time until after the 15th, so perhaps that's in November's
commit fest?

> The skip table really should be constructed only once in
> text_position_start and stored in TextPositionState. That would make a
> big difference to the performance of those functions that call
> text_position_next repeatedly: replace_text, split_text and text_to_array.

Of course you are right. That will help for replace and the like. I'll
update the patch tonight.

Thanks both for the feedback.

David.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David Rowley" <dgrowley(at)gmail(dot)com>
Cc: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-05 17:12:17
Message-ID: 27136.1220634737@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"David Rowley" <dgrowley(at)gmail(dot)com> writes:
> Tom Lane wrote:
>> Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
>> pattern, N = length of search string); whereas full B-M is O(N). Maybe we
>> should build the second table when M is large?

> I'll look into this. If it pays off for longer searches I'll submit a patch.
> I won't have the time until after the 15th, so perhaps that's in November's
> commit fest?

Yes. Let's get B-M-H in during this fest and then you can look into
whether a follow-on patch is worthwhile.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <dgrowley(at)gmail(dot)com>, 'Peter Eisentraut' <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-05 17:25:56
Message-ID: 48C16BA4.2060407@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> "David Rowley" <dgrowley(at)gmail(dot)com> writes:
>> Tom Lane wrote:
>>> Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
>>> pattern, N = length of search string); whereas full B-M is O(N). Maybe we
>>> should build the second table when M is large?
>
>> I'll look into this. If it pays off for longer searches I'll submit a patch.
>> I won't have the time until after the 15th, so perhaps that's in November's
>> commit fest?
>
> Yes. Let's get B-M-H in during this fest and then you can look into
> whether a follow-on patch is worthwhile.

Agreed.

Also, it would be nice to use B-M(-H) for LIKE as well. That's a
different codepath, but probably more widely used than textpos and
frients. I haven't looked how feasible it would be to do, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: David Rowley <dgrowley(at)gmail(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-05 17:29:29
Message-ID: 27645.1220635769@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Also, it would be nice to use B-M(-H) for LIKE as well.

Right offhand, that seems impossible, at least in patterns with %.
Or were you thinking of trying to separate out the fixed substrings
of a pattern and search for them with BMH?

Anyway, it's not material for this patch, since it'd involve pretty
fundamental redesign of the LIKE code.

regards, tom lane


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-05 20:13:58
Message-ID: C848A7ECB1E346388B9BD7F7966C0DDC@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> The skip table really should be constructed only once in
> text_position_start and stored in TextPositionState. That would make a
> big difference to the performance of those functions that call
> text_position_next repeatedly: replace_text, split_text and text_to_array.

I Wrote:
> Of course you are right. That will help for replace and the like. I'll
> update the patch tonight.

I've made and attached the changes Heikki recommended.
Also updated benchmark spreadsheet. Here ->
http://www.unixbeast.com/~fat/8.3_test_v1.2.xls

Previously there was an error with "Test 6", the test that benchmarked
replace(). I kept this one so not to affect the summary result in the new
sheet. I then added the sheet "Replace Test" to show more accurate results.
I had failed to notice that the optimizer was helping me out more than I
wanted it to.

My tested replace() script runs in 91% of the time than the 8.3 version.
I've not tested with the CVS head.

Now that the skip table is a member of TextPositionState, I was not quite
sure if I should #define a size for it. It would certainly look neater, only
the code that defines the skip table size in text_position_start assumes 256
elements.

Any thoughts on this?

David.

Attachment Content-Type Size
boyer-moore-strpos_v1.2.patch application/octet-stream 9.0 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David Rowley" <dgrowley(at)gmail(dot)com>
Cc: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-06 22:21:01
Message-ID: 26072.1220739661@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"David Rowley" <dgrowley(at)gmail(dot)com> writes:
> I've made and attached the changes Heikki recommended.

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle. Isn't that a bug? Was the same bug present in the tests you
made to determine the best table sizes?

Stylistic issues:

I really dislike having two copies of the heuristic size-setting code.
I think it's worth introducing a second test of use_wchar in order to
arrange text_position_setup like this:

... existing code ...

choose skiptablesize

initialize skip table (this loop doesn't depend on char width)

if (!state->use_wchar)
process char needle
else
process wide-char needle

Also it might be worth having local-variable copies of skiptablesize and
len2 for this initialization loop:

for (ai = 0; ai <= state->skiptablesize; ai++)
state->skiptable[ai] = state->len2;

I'm not at all sure whether a C compiler will think that it's allowed to
avoid fetching these values again on each iteration, seeing that the
loop is storing into other integer members of *state. Given that this
loop is exactly the overhead we're trying to control by adjusting
skiptablesize, making it as tight as possible seems worthwhile.

> Now that the skip table is a member of TextPositionState, I was not quite
> sure if I should #define a size for it. It would certainly look neater, only
> the code that defines the skip table size in text_position_start assumes 256
> elements.

I tend to agree that a #define is pointless there, since there's no easy
way to make the relevant code do anything with it. It would be
worthwhile to point out adjacent to the code what the max length is,
though. I had gotten as far as rewriting your introductory comment like
this

/*
* Prepare the skip table for Boyer-Moore-Horspool searching. First we
* must determine how big a skip table to use. The declaration of
* TextPositionState allows up to 256 elements, but with haystack lengths
* of only a few characters we don't really want to have to initialize so
* many elements --- it would take too long in comparison to the actual
* search time. So we choose a useful skip table size based on the
* haystack length.

before noticing that what I was writing wasn't what the code actually
did. Another point that doesn't seem to be included in your comments is
that the skip table size *must* be a power of 2 because we use bit
masking. (In fact state->skiptablesize might better be named skiptablemask
since it's really 1 less than the table size.)

BTW, try to avoid introducing so much random vertical space. IMHO it
does little for readability and much to make routines too long to fit in
one screenful. Generally the PG code doesn't use double blank lines
anywhere inside a function, nor blank lines before a right brace line.
pg_indent will clean this up to some extent, but it would be better if
the code looked more like the project style in the first place.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David Rowley" <dgrowley(at)gmail(dot)com>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-06 22:50:47
Message-ID: 26522.1220741447@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> I looked this over a bit and was immediately confused by one thing:
> the introductory comment says that the skip table size ought to be based
> on the length of the haystack, which makes sense to me, but the code is
> actually initializing it on the basis of len2, ie, the length of the
> needle. Isn't that a bug? Was the same bug present in the tests you
> made to determine the best table sizes?

BTW, to the extent that you feel like testing a different idea,
I would suggest:

* don't bother initializing the skiptable when len1 < len2

* otherwise, choose its size based on len1 - len2, not just len1 or
len2. This is (one less than) the maximum number of search loop
consultations of the skip table that can happen, so it seems like a
plausible number, and better than either length alone.

regards, tom lane


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-06 23:25:07
Message-ID: 9CD35557A45746D183E72D71802749D4@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> I looked this over a bit and was immediately confused by one thing:
> the introductory comment says that the skip table size ought to be based
> on the length of the haystack, which makes sense to me, but the code is
> actually initializing it on the basis of len2, ie, the length of the
> needle. Isn't that a bug? Was the same bug present in the tests you
> made to determine the best table sizes?

Yes Bug. That's what I get for making last minute changes. It didn't affect
the benchmark, that was done before moving the code into postgresql for
testing. The function I tested with had an extra parameter to set the skip
table size. Each possible size was tested and the best time was saved along
with the best size.

> BTW, to the extent that you feel like testing a different idea,
> I would suggest:

> * don't bother initializing the skiptable when len1 < len2

Good plan.

> * otherwise, choose its size based on len1 - len2, not just len1 or
> len2. This is (one less than) the maximum number of search loop
> consultations of the skip table that can happen, so it seems like a
> plausible number, and better than either length alone.

That seems like a better idea. I had considered len1 * len2, giving that's
the worst case for BMH. Of course the lengths would need to be raised quite
a bit. I'll go with len1 - len2

I'll make the above changes and then run my benchmark again to see if the
skip table size logic should be updated.

I'll also benchmark and update the benchmark spreadsheet to see what's
changed.

David.


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-07 02:03:57
Message-ID: 2B14F8402C8E4A3C9E240E3DAF31BA25@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

<<<Moved from pgsql-patches>>>

Tom Lane wrote:
> I wrote:
> > I looked this over a bit and was immediately confused by one thing:
> > the introductory comment says that the skip table size ought to be based
> > on the length of the haystack, which makes sense to me, but the code is
> > actually initializing it on the basis of len2, ie, the length of the
> > needle. Isn't that a bug? Was the same bug present in the tests you
> > made to determine the best table sizes?
>
> BTW, to the extent that you feel like testing a different idea,
> I would suggest:
>
> * don't bother initializing the skiptable when len1 < len2
>
> * otherwise, choose its size based on len1 - len2, not just len1 or
> len2. This is (one less than) the maximum number of search loop
> consultations of the skip table that can happen, so it seems like a
> plausible number, and better than either length alone.

I've made the discussed changes. Also updated the benchmark results.
http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

On re-benchmarking to determine the best skip table size, the changes to the
logic didn't seem to affect the results enough to have to change the
thresholds. But I'm sure it will perform a little better in cases like when
both the needle and haystack are very long and close to being the same
length. Having a big skip table in this case is a little bit of a waste as
the search won't get to make much use of it.

Attachment Content-Type Size
boyer-moore-strpos_v1.3.patch application/octet-stream 7.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David Rowley" <dgrowley(at)gmail(dot)com>
Cc: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-07 04:26:47
Message-ID: 6183.1220761607@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"David Rowley" <dgrowley(at)gmail(dot)com> writes:
> I've made the discussed changes. Also updated the benchmark results.
> http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

Applied with revisions; mostly cosmetic except for one point. I
realized after studying the code a bit more that B-M cannot possibly win
for a single-character pattern (needle), since the skip distance must
always be 1 in that case. The fact that it seemed to keep up at that
length has to be because the original coding included a strncmp call
inside the innermost loop, which likely prevents the compiler from
optimizing that loop really tightly. But the strncmp wasn't doing
anything anyway for the case of pattern length = 1. So what I committed
special-cases pattern length 1 to be a naive search with a *very* tight
inner loop. I think it's worth troubling over this case because a
common usage is split_to_array and suchlike with single-character
delimiters.

regards, tom lane


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-07 11:04:35
Message-ID: E737D69A2FDE4F79832CDAF8FC6C1DEC@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane Wrote:
> "David Rowley" <dgrowley(at)gmail(dot)com> writes:
> > I've made the discussed changes. Also updated the benchmark results.
> > http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

> Applied with revisions; mostly cosmetic except for one point. I
> realized after studying the code a bit more that B-M cannot possibly win
> for a single-character pattern (needle), since the skip distance must
> always be 1 in that case. The fact that it seemed to keep up at that
> length has to be because the original coding included a strncmp call
> inside the innermost loop, which likely prevents the compiler from
> optimizing that loop really tightly. But the strncmp wasn't doing
> anything anyway for the case of pattern length = 1. So what I committed
> special-cases pattern length 1 to be a naive search with a *very* tight
> inner loop. I think it's worth troubling over this case because a
> common usage is split_to_array and suchlike with single-character
> delimiters.

I had thought about this, but failed to think about string_to_array.
Probably worth the extra code for that.

Well simple enough patch. I've learned quite a bit about the postgresql
source even just for working in varlena.c. I've got a very long way to go
though.

Thanks for all the reviews and suggestions.

David.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <dgrowley(at)gmail(dot)com>, 'Peter Eisentraut' <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-08 06:45:18
Message-ID: 48C4C9FE.20905@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Also, it would be nice to use B-M(-H) for LIKE as well.
>
> Right offhand, that seems impossible, at least in patterns with %.
> Or were you thinking of trying to separate out the fixed substrings
> of a pattern and search for them with BMH?

Yep, something like that. Even if it only handled the special case of
'%foobar%', that would be nice, because that's a pretty common special case.

> Anyway, it's not material for this patch, since it'd involve pretty
> fundamental redesign of the LIKE code.

Yep.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: David Rowley <dgrowley(at)gmail(dot)com>
Cc: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, 'Peter Eisentraut' <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-08 07:49:36
Message-ID: 48C4D910.90401@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

David Rowley wrote:
> Thanks for all the reviews and suggestions.

David, could you re-run the performance tests you ran earlier? I'm
curious to know what impact switching to the simpler loop for 1-byte
pattern had.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-08 19:15:02
Message-ID: 17EF6EF77BD7405E86761C688A73BAB4@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> David Rowley wrote:
> Thanks for all the reviews and suggestions.

> David, could you re-run the performance tests you ran earlier? I'm
> curious to know what impact switching to the simpler loop for 1-byte
> pattern had.

Sure: http://www.unixbeast.com/~fat/8.3_test_v1.4.xls

I tested with 8.3.3, and applied the patch updated by tom.

The thing that surprised me was that string_to_array didn't perform as well.
I expected single character searches to perform a little better. I can't
think why it would be slower now.

The strpos() test for the single char needle is taking 1.2% longer. I guess
that makes a little sense as we're now doing a new more length tests in this
case, but then we've eliminated the final single strncmp() for once it finds
that match. The strncmp would have check for nulls, check it's not exceeded
the compare length and check the chars actually match. I figured this would
have made up for that 1.2%. Seems not.

David.

-----Original Message-----
From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
Sent: 08 September 2008 08:50
To: David Rowley
Cc: 'Tom Lane'; 'Peter Eisentraut'; pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] [HACKERS] TODO item: Implement Boyer-Moore searching
(First time hacker)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-08 19:19:35
Message-ID: 0D0F8DC5EE1341A199B5BDA73F870BA7@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas Wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> >> Also, it would be nice to use B-M(-H) for LIKE as well.
> >
> > Right offhand, that seems impossible, at least in patterns with %.
> > Or were you thinking of trying to separate out the fixed substrings
> > of a pattern and search for them with BMH?

> Yep, something like that. Even if it only handled the special case of
> '%foobar%', that would be nice, because that's a pretty common special
> case.

It would be a quick test to check for % at either end. But we'd also need to
ensure there were none in the middle. That might slow it down. I guess while
looping over the inner chars checking for more %'s we could be building a
skiptable. We'd have to abort it if we found any %'s of course

I think with out giving it much thought _'s could be handled by BMH, these
could just be given a skip distance of 2. Only having a lossy skip table
throws that out the window without having a special check for _

David.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: David Rowley <dgrowley(at)gmail(dot)com>
Cc: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, 'Peter Eisentraut' <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-11 07:20:50
Message-ID: 48C8C6D2.2010809@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

David Rowley wrote:
> The thing that surprised me was that string_to_array didn't perform as well.
> I expected single character searches to perform a little better. I can't
> think why it would be slower now.

Yes, that's strange. I tried to reproduce that here, with a CVS snapshot
before the patch, and after. With quick testing with psql and \timing
and the same query you had in that spreadsheet, I couldn't see that kind
of performance degradation. Oprofile suggests that, on the contrary,
slightly less time is spent in text_position_next() after the patch, and
slightly more in text_position_setup(). Together they account for ~10%
of CPU time in both tests, so a small difference there would be
insignificant anyway in that test case.

I think we're fine.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com