Re: tsearch parser inefficiency if text includes urls or emails - new version

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <andres(at)anarazel(dot)de>
Cc: <greg(at)2ndquadrant(dot)com>,<pgsql-hackers(at)postgresql(dot)org>, <oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 15:23:11
Message-ID: 4B1E1AFF020000250002D1E1@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:

> Frankly, I'd be amazed if there was a performance regression,

OK, I'm amazed. While it apparently helps some cases dramatically
(Andres had a case where run time was reduced by 93.2%), I found a
pretty routine case where run time was increased by 3.1%. I tweaked
the code and got that down to a 2.5% run time increase. I'm having
troubles getting it any lower than that. And yes, this is real, not
noise -- the slowest unpatched time for this test is faster than the
fastest time with any version of the patch. :-(

Andres, could you provide more information on the test which showed
the dramatic improvement? In particular, info on OS, CPU, character
set, encoding scheme, and what kind of data was used for the test.

I'll do some more testing and try to figure out how the patch is
slowing things down and post with details.

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 15:26:11
Message-ID: 200912081626.11709.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 08 December 2009 16:23:11 Kevin Grittner wrote:
> I wrote:
> > Frankly, I'd be amazed if there was a performance regression,
>
> OK, I'm amazed. While it apparently helps some cases dramatically
> (Andres had a case where run time was reduced by 93.2%), I found a
> pretty routine case where run time was increased by 3.1%. I tweaked
> the code and got that down to a 2.5% run time increase. I'm having
> troubles getting it any lower than that. And yes, this is real, not
> noise -- the slowest unpatched time for this test is faster than the
> fastest time with any version of the patch. :-(
>
> Andres, could you provide more information on the test which showed
> the dramatic improvement? In particular, info on OS, CPU, character
> set, encoding scheme, and what kind of data was used for the test.
>
> I'll do some more testing and try to figure out how the patch is
> slowing things down and post with details.
Could you show your testcase? I dont see why it could get slower?

I tested with various data, the one benefiting most was some changelog where
each entry was signed by an email.

OS: Debian Sid, Core2 Duo, UTF-8, and I tried both C and de_DE.UTF8.

Andres


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>
Cc: <greg(at)2ndquadrant(dot)com>,<pgsql-hackers(at)postgresql(dot)org>, <oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 16:15:36
Message-ID: 4B1E2748020000250002D1EF@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> wrote:

> Could you show your testcase?

OK. I was going to try to check other platforms first, and package
up the information better, but here goes.

I created 10000 lines with random IP-based URLs for a test. The
first few lines are:

create table t1 (c1 int not null primary key, c2 text);
insert into t1 values (2,
'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.html
http://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.html
http://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html

Actually, the special character was initially the word "the", but I
wanted to see if having non-ASCII characters in the value made any
difference. It didn't.

Unfortunately, I was testing at home last night and forgot to bring
the exact test query with me, but it was this or something close to
it:

\timing
select to_tsvector(c2)
from t1, (select generate_series(1,200)) x where c1 = 2;

I was running on Ubuntu 9.10, an AMD dual core CPU (don't have the
model number handy), UTF-8, en_US.UTF8.

> I dont see why it could get slower?

I don't either. The best I can tell, following the pointer from
orig to any of its elements seems to be way more expensive than I
would ever have guessed. The only thing that seemed to improve the
speed was minimizing that by using a local variable to capture any
element referenced more than once. (Although, there is overlap
between the timings for the original patch and the one which seemed
a slight improvement; I would need to do more testing to really rule
out noise and have complete confidence that my changes actually are
an improvement on the original patch.)

Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD
CPU? (I'm looking forward to testing this today on a 64 bit build
on an Intel CPU.)

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 16:17:19
Message-ID: 200912081717.19531.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
> Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Could you show your testcase?
Will hopefully look into this later.

> > I dont see why it could get slower?
> I don't either. The best I can tell, following the pointer from
> orig to any of its elements seems to be way more expensive than I
> would ever have guessed. The only thing that seemed to improve the
> speed was minimizing that by using a local variable to capture any
> element referenced more than once. (Although, there is overlap
> between the timings for the original patch and the one which seemed
> a slight improvement; I would need to do more testing to really rule
> out noise and have complete confidence that my changes actually are
> an improvement on the original patch.)
>
> Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD
> CPU? (I'm looking forward to testing this today on a 64 bit build
> on an Intel CPU.)
Did you test that with a optimized build?

Andres


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>
Cc: <greg(at)2ndquadrant(dot)com>,<pgsql-hackers(at)postgresql(dot)org>, <oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 16:40:31
Message-ID: 4B1E2D1F020000250002D1FC@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> wrote:

> Did you test that with a optimized build?

After running a series of tests with --enable-cassert to confirm
that nothing squawked, I disabled that before doing any
performance testing. Going from memory, I used --enable-debug
--with-libxml and --prefix. I didn't explicitly use or disable any
compiler optimizations.

Possibly relevant is that I was using Greg Smith's peg tool (as best
I could):

http://github.com/gregs1104/peg/

I'm not aware of it doing anything to the compile options, but I
didn't look for that, either. I guess I should poke around that
some more to be sure. After having peg do the checkout and set up
the directories, did my own ./configure and make runs so that I
could be sure of my configure settings.

Are there any settings which you feel I should be using which
PostgreSQL doesn't set up as you would recommend during the
./configure run?

-Kevin


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 19:09:22
Message-ID: 4B1EA462.40307@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Grittner wrote:
> After running a series of tests with --enable-cassert to confirm
> that nothing squawked, I disabled that before doing any
> performance testing. Going from memory, I used --enable-debug
> --with-libxml and --prefix. I didn't explicitly use or disable any
> compiler optimizations.
>
> Possibly relevant is that I was using Greg Smith's peg tool (as best
> I could):
>
> http://github.com/gregs1104/peg/
>
> I'm not aware of it doing anything to the compile options, but I
> didn't look for that, either.
Now that you ask, of course I just spotted a bug in there such that the
documented behavior for the PGDEBUG feature doesn't actually work. If
you were using that to turn off asserts, that may not have worked as
expected. Don't know what you did there exactly. Fix now pushed to the
repo.

I general, when doing performance testing, now matter how careful I've
been I try to do a:

show debug_assertions;

To confirm I'm not running with those on.

Andres, are using any optimization flags when you're testing?

Would have preferred that the first mention of my new project didn't
involve a bug report, but that's software I guess. For everyone here
who's not on the reviewers mailing list: peg is a scripting tool I've
put together recently that makes it easier to setup the environment
(source code, binaries, database) for testing PostgreSQL in a
development content. It takes care of grabbing the latest source,
building, starting the database, all that boring stuff. I tried to make
it automate the most tedious parts of both development and patch
review. Documentation and the program itself are at the git repo Kevin
mentioned.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <greg(at)2ndquadrant(dot)com>
Cc: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 19:20:32
Message-ID: 4B1E52A0020000250002D216@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

> Now that you ask, of course I just spotted a bug in there such
> that the documented behavior for the PGDEBUG feature doesn't
> actually work. If you were using that to turn off asserts, that
> may not have worked as expected. Don't know what you did there
> exactly. Fix now pushed to the repo.

Thanks. I was going to reply to your original message with my
experiences (and probably still will), but it seemed like it might
be relevant here. I did check pg_config results before doing
anything, and saw that the debug and cassert weren't set, so that's
why I did explicit configure and make commands. Even without that
covered, peg was a nice convenience -- I can say that it saved me
more time already than it took to install and read the docs. Nice
work!

Anyway, I'm not sure whether your reply directly answers the point
I was raising -- peg doesn't do anything with the compiler
optimization flags under the covers, does it?

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 19:25:17
Message-ID: 200912082025.17505.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Tuesday 08 December 2009 20:09:22 Greg Smith wrote:
> Andres, are using any optimization flags when you're testing?
I was testing with and without debug/cassert - and did not get adverse results
in both...

Unfortunately it looks like I wont get to test today, but only tomorrow
morning its 9pm and I am not yet fully finished with work....

Andres


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 19:40:06
Message-ID: 4B1EAB96.5050903@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Grittner wrote:
> Anyway, I'm not sure whether your reply directly answers the point
> I was raising -- peg doesn't do anything with the compiler
> optimization flags under the covers, does it?
>
Not really. It does this:

PGDEBUG="--enable-cassert --enable-debug"
./configure --prefix=$PGINST/$PGPROJECT --enable-depend
--enable-thread-safety $PGDEBUG

Which are pretty standard options. The idea is that you'd use the
default normally, then just set PGDEBUG=" " for a non-debug build--or to
otherwise change the configure flags but still get things done
automatically for you. If it's set before the script starts, it doesn't
change it.

I did try to design things so that you could do any step in the
automated series manually and not have that screw things up. There's
hundreds of lines of code in there just for things like figuring out
whether configure has been run or not yet when it decides you need to
build, so it can try to do the right thing in either case. My hope was
that anyone who tried peg out would find it a net positive time savings
after a single use, glad to hear I accomplished that in your case.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 19:43:12
Message-ID: 27017.1260301392@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Kevin Grittner wrote:
>> Anyway, I'm not sure whether your reply directly answers the point
>> I was raising -- peg doesn't do anything with the compiler
>> optimization flags under the covers, does it?
>>
> Not really. It does this:

> PGDEBUG="--enable-cassert --enable-debug"
> ./configure --prefix=$PGINST/$PGPROJECT --enable-depend
> --enable-thread-safety $PGDEBUG

--enable-cassert might have a noticeable performance impact.
We usually try to not have that on when doing performance testing.

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 19:50:05
Message-ID: 4B1EADED.1030202@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> --enable-cassert might have a noticeable performance impact.
> We usually try to not have that on when doing performance testing.
>
All covered in the tool's documentation, and it looks like Kevin did the
right thing during his tests by checking the pg_config output to confirm
the right flags were used. I think we can rule out simple pilot error
here and work under the assumption Kevin found a mild performance
regression under some circumstances with the patch. Hopefully Andres
will be able to replicate the problem, and it sounds like Kevin might be
able to provide more information about it tonight too.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <greg(at)2ndquadrant(dot)com>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 19:55:12
Message-ID: 4B1E5AC0020000250002D236@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> --enable-cassert might have a noticeable performance impact.
> We usually try to not have that on when doing performance testing.

Right. I turned it on for some initial tests to confirm that we had
no assertion failures; then turned it off for the performance
testing. I did leave --enable-debug on during performance testing.
My understanding is that debug info doesn't affect performance, but
I guess it never hurts to try an empirical test to confirm.

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>
Cc: <greg(at)2ndquadrant(dot)com>,<pgsql-hackers(at)postgresql(dot)org>, <oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-08 22:07:04
Message-ID: 4B1E79A8020000250002D24A@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:

> Perhaps it is some quirk of using 32 bit pointers on the 64 bit
> AMD CPU? (I'm looking forward to testing this today on a 64 bit
> build on an Intel CPU.)

The exact same test on 64 bit OS (SuSE Enterprise Server) on Intel
gave very different results. With 10 runs each of 200 iterations of
parsing the 10000 URLs, the patch Andres submitted ran 0.4% faster
than HEAD, and my attempt to improve on it ran 0.6% slower than
HEAD. I'll try to run the numbers to get the percentage chance that
a random distribution would have generated a spread as large as
either of those; but I think it's safe to say that the submitted
patch doesn't hurt there and that my attempt to improve on it was
misdirected. :-/

I would like to independently confirm the dramatic improvement
reported by Andres. Could I get a short snippet from the log which
was used for that, along with an indication of the size of the text
parsed in that test? (Since the old code looks like it might have
O(N^2) performance in some situations, while the patch changes that
to O(N), I might not be testing with a big enough N.)

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-09 18:06:45
Message-ID: 200912091906.46404.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
> Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Could you show your testcase?
>
> OK. I was going to try to check other platforms first, and package
> up the information better, but here goes.
>
> I created 10000 lines with random IP-based URLs for a test. The
> first few lines are:
>
> create table t1 (c1 int not null primary key, c2 text);
> insert into t1 values (2,
> 'http://255.102.51.212/*/quick/brown/fox?jumps&over&*&lazy&dog.html
> http://204.56.222.143/*/quick/brown/fox?jumps&over&*&lazy&dog.html
> http://138.183.168.227/*/quick/brown/fox?jumps&over&*&lazy&dog.html
I think you see no real benefit, because your strings are rather short - the
documents I scanned when noticing the issue where rather long.
If your strings are short, they and the copy will fit into cpu cache anyway, so
copying them around/converting them to some other string format is not that
expensive compared to the rest of the work done.

Also after each copying step for large strings the complete cache is filled
with unrelated information (namely the end of the string). So every charwise
access will need to wait for a memory access.

A rather extreme/contrived example:

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
1) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 3.740 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
1000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 115.027 ms

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
10000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 24355.339 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
20000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 47276.739 ms

One easily can see the quadratic complexity here. The quadratic complexity
lies in the length/amount of emails/urls of the strings, not in the number of
to_tsvector calls!

After the patch:

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM generate_series(1,
20000) g(i)), ' - '));
?column?
----------
1
(1 row)

Time: 168.384 ms

I could not reproduce the slowdown you mentioned:

Without patch:
postgres=# SELECT to_tsvector('andres(at)anarazel(dot)de
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i)
ORDER BY g.i LIMIT 1;
to_tsvector
-------------------------------------------------------------------------------
'/1':4 'andres(at)anarazel(dot)de':1 'www.postgresql.org':3
'www.postgresql.org/1':2
(1 row)
Time: 1109.833 ms

With patch:
postgres=# SELECT to_tsvector('andres(at)anarazel(dot)de
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 100000) g(i)
ORDER BY g.i LIMIT 1;
to_tsvector
-------------------------------------------------------------------------------
'/1':4 'andres(at)anarazel(dot)de':1 'www.postgresql.org':3
'www.postgresql.org/1':2
(1 row)

Time: 1036.689 ms

So on the hardware I tried its even a little bit faster for small strings
(Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit).

I could not reproduce any difference with strings not involving urls or emails:

Without patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text)
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;
to_tsvector
--------------------------------------------------------
'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6
(1 row)

Time: 988.426 ms

With patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text)
FROM generate_series(1, 100000) g(i) ORDER BY g.i LIMIT 1;
to_tsvector
--------------------------------------------------------
'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6
(1 row)

Time: 975.339 ms

So at least in my testing I do see no danger in the patch ;-)

Andres

PS: I averaged all the time result over multiple runs where it was relevant.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>
Cc: <greg(at)2ndquadrant(dot)com>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 17:01:05
Message-ID: 4B20D4F1020000250002D2F1@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> wrote:

> I think you see no real benefit, because your strings are rather
> short - the documents I scanned when noticing the issue where
> rather long.

The document I used in the test which showed the regression was
672,585 characters, containing 10,000 URLs.

> A rather extreme/contrived example:

> postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT
> 'andres(at)anarazel(dot)de http://www.postgresql.org/'::text FROM
> generate_series(1,
> 20000) g(i)), ' - '));

The most extreme of your examples uses a 979,996 character string,
which is less than 50% larger than my test. I am, however, able to
see the performance difference for this particular example, so I now
have something to work with. I'm seeing some odd behavior in terms
of when there is what sort of difference. Once I can categorize it
better, I'll follow up.

Thanks for the sample which shows the difference.

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>
Cc: <greg(at)2ndquadrant(dot)com>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 18:10:24
Message-ID: 4B20E530020000250002D305@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:

> Thanks for the sample which shows the difference.

Ah, once I got on the right track, there is no problem seeing
dramatic improvements with the patch. It changes some nasty O(N^2)
cases to O(N). In particular, the fixes affect parsing of large
strings encoded with multi-byte character encodings and containing
email addresses or URLs with a non-IP-address host component. It
strikes me as odd that URLs without a slash following the host
portion, or with an IP address, are treated so differently in the
parser, but if we want to address that, it's a matter for another
patch.

I'm inclined to think that the minimal differences found in some of
my tests probably have more to do with happenstance of code
alignment than the particulars of the patch.

I did find one significant (although easily solved) problem. In the
patch, the recursive setup of usewide, pgwstr, and wstr are not
conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
version. Unless there's a good reason for that, the #ifdef should
be added.

Less critical, but worth fixing one way or the other, TParserClose
does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
TParserCopyClose does. I think this should be consistent.

Finally, there's that spelling error in the comment for
TParserCopyInit. Please fix.

If a patch is produced with fixes for these three things, I'd say
it'll be ready for committer. I'm marking it as Waiting on Author
for fixes to these three items.

Sorry for the delay in review. I hope there's still time to get
this committed in this CF.

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: <greg(at)2ndquadrant(dot)com>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 19:05:13
Message-ID: 4B20F209020000250002D317@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:

> I did find one significant (although easily solved) problem. In
> the patch, the recursive setup of usewide, pgwstr, and wstr are
> not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version. Unless there's a good reason for that, the #ifdef should
> be added.

That should read:

I did find one significant (although easily solved) problem. In
the patch, the recursive setup of usewide, pgwstr, and wstr are
not conditioned by #ifdef USE_WIDE_UPPER_LOWER as they are in the
non-patched version. Unless there's a good reason for that, the
#ifdef should be added.

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 19:05:15
Message-ID: 200912102005.16560.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote:
> I wrote:
> > Thanks for the sample which shows the difference.
>
> Ah, once I got on the right track, there is no problem seeing
> dramatic improvements with the patch. It changes some nasty O(N^2)
> cases to O(N). In particular, the fixes affect parsing of large
> strings encoded with multi-byte character encodings and containing
> email addresses or URLs with a non-IP-address host component. It
> strikes me as odd that URLs without a slash following the host
> portion, or with an IP address, are treated so differently in the
> parser, but if we want to address that, it's a matter for another
> patch.
Same here. Generally I do have to say that I dont find that parser very nice -
and actually I think its not very efficient as well because it searches a big
part of the search space all the time.
I think a generated parser would be more efficient and way much easier to
read...

> I'm inclined to think that the minimal differences found in some of
> my tests probably have more to do with happenstance of code
> alignment than the particulars of the patch.
Yes, I think that as well.

> I did find one significant (although easily solved) problem. In the
> patch, the recursive setup of usewide, pgwstr, and wstr are not
> conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version. Unless there's a good reason for that, the #ifdef should
> be added.
Uh. Sorry. Fixed.

> Less critical, but worth fixing one way or the other, TParserClose
> does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
> TParserCopyClose does. I think this should be consistent.
Actually there was *some* thinking behind that: The end of parsing is quite
obvious (the call returns, and its so verbose you will never do more than one
call anyway) - but knowing where the copy ends is quite important to
understand the parse flow.
I added a log there because in the end that line is not going to make any
difference ;-)

> Sorry for the delay in review. I hope there's still time to get
> this committed in this CF.
Thanks for your reviewing!

Actually I dont mind very much if it gets delayed or not. Its trivial enough
that it shouldnt cause much work/conflicts/whatever next round and I am running
patched versions anyway, so ...

Andres

Attachment Content-Type Size
0001-Fix-TSearch-inefficiency-because-of-repeated-copying.patch text/x-patch 3.9 KB

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>
Cc: <greg(at)2ndquadrant(dot)com>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-10 19:29:18
Message-ID: 4B20F7AE020000250002D330@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> wrote:

> [concerns addressed in new patch version]

Looks good to me. I'm marking this Ready for Committer.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Andres Freund" <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, greg(at)2ndquadrant(dot)com, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-15 20:39:58
Message-ID: 13881.1260909598@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Andres Freund <andres(at)anarazel(dot)de> wrote:
>> [concerns addressed in new patch version]

> Looks good to me. I'm marking this Ready for Committer.

Applied with minor editorialization --- improving the comments mostly.

regards, tom lane