Re: tsearch parser inefficiency if text includes urls or emails - new version

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <andres(at)anarazel(dot)de>,<pgsql-hackers(at)postgresql(dot)org>
Cc: <oleg(at)sai(dot)msu(dot)su>,<teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-11-14 14:33:00
Message-ID: 4AFE6B3C020000250002C86C@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund wrote:
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
>> It is in context format, applies cleanly, and passes "make check".
> Unfortunately the latter is not saying much - I had a bug there and
> it was not found by the regression tests. Perhaps I should take a
> stab and add at least some more...

Sounds like a good idea. The one thing to avoid is anything with a
long enough run time to annoy those that run it many times a day.

>> It is in context format, applies cleanly, and passes "make check".
>> Next I read through the code, and have the same question that
>> Andres posed 12 days ago. His patch massively reduces the cost of
>> the parser recursively calling itself for some cases, and it seems
>> like the least invasive way to modify the parser to solve this
>> performance problem; but it does beg the question of why a state
>> machine like this should recursively call itself when it hits
>> certain states.
> I was wondering about that as well. I am not completely sure but to
> me it looks like its just done to reduce the amount of rules and
> states.

I'm assuming that's the reason, but didn't dig deep enough to be sure.
I suspect to be really sure, I'd have to set it up without the
recursion and see what breaks. I can't imagine it would be anything
which couldn't be fixed by adding enough states; but perhaps they ran
into something where these types would require so many new states that
the recursion seemed like the lesser of evils.

> I have to say that that code is not exactly clear and well
> documented...

Yeah. I was happy with the level of documentation that you added with
your new code, but what was there before is mighty thin. If you
gleaned enough information while working on it to feel comfortable
adding documentation anywhere else, that would be a good thing.

So far the only vote is to proceed with the mitigation, which was my
preference, and apparently yours -- so I guess we're at 3 to 0 in
favor of that. I'll mark the patch as "Waiting on Author" so you can
add any comments and regression tests you feel are appropriate.

By the way, I found one typo in the comments -- it should by useful,
not usefull.

I liked what I saw so far, but need to spend more time desk-checking
for correctness, testing to confirm that it doesn't change results,
and confirming the performance improvement.

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-11-25 13:18:46
Message-ID: 200911251418.47137.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote:
> Andres Freund wrote:
>
> On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
> >> It is in context format, applies cleanly, and passes "make check".
> >
> > Unfortunately the latter is not saying much - I had a bug there and
> > it was not found by the regression tests. Perhaps I should take a
> > stab and add at least some more...
>
> Sounds like a good idea. The one thing to avoid is anything with a
> long enough run time to annoy those that run it many times a day.
Hm. There actually are tests excercising the part where I had a bug...
Strange.
It was a bug involving uninitialized data so probably the regression tests
where just "lucky".

> >> It is in context format, applies cleanly, and passes "make check".
> >> Next I read through the code, and have the same question that
> >> Andres posed 12 days ago. His patch massively reduces the cost of
> >> the parser recursively calling itself for some cases, and it seems
> >> like the least invasive way to modify the parser to solve this
> >> performance problem; but it does beg the question of why a state
> >> machine like this should recursively call itself when it hits
> >> certain states.
> > I was wondering about that as well. I am not completely sure but to
> > me it looks like its just done to reduce the amount of rules and
> > states.
> I'm assuming that's the reason, but didn't dig deep enough to be sure.
> I suspect to be really sure, I'd have to set it up without the
> recursion and see what breaks. I can't imagine it would be anything
> which couldn't be fixed by adding enough states; but perhaps they ran
> into something where these types would require so many new states that
> the recursion seemed like the lesser of evils.
This is similar to my understanding...

> > I have to say that that code is not exactly clear and well
> > documented...
> Yeah. I was happy with the level of documentation that you added with
> your new code, but what was there before is mighty thin. If you
> gleaned enough information while working on it to feel comfortable
> adding documentation anywhere else, that would be a good thing.
It definitely would be a good thing. But that would definitely be seperate
patch. But I fear my current leel of knowledge is sufficient and also I am not
sure if I can make myself interested enough in that part.

> So far the only vote is to proceed with the mitigation, which was my
> preference, and apparently yours -- so I guess we're at 3 to 0 in
> favor of that. I'll mark the patch as "Waiting on Author" so you can
> add any comments and regression tests you feel are appropriate.
>
> By the way, I found one typo in the comments -- it should by useful,
> not usefull.
Ok, will update.

> I liked what I saw so far, but need to spend more time desk-checking
> for correctness, testing to confirm that it doesn't change results,
> and confirming the performance improvement.
Thanks again for your reviewing!

Andres


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andres Freund" <andres(at)anarazel(dot)de>
Cc: <pgsql-hackers(at)postgresql(dot)org>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-11-25 17:43:43
Message-ID: 4B0D186F020000250002CD0C@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote:
>> Andres Freund wrote:

>> > I had a bug there and it was not found by the regression tests.
>> > Perhaps I should take a stab and add at least some more...
>>
>> Sounds like a good idea.

> Hm. There actually are tests excercising the part where I had a
> bug... Strange. It was a bug involving uninitialized data so
> probably the regression tests where just "lucky".

OK. I won't be looking for extra tests.

>> > I have to say that that code is not exactly clear and well
>> > documented...
>> Yeah. I was happy with the level of documentation that you added
>> with your new code, but what was there before is mighty thin. If
>> you gleaned enough information while working on it to feel
>> comfortable adding documentation anywhere else, that would be a
>> good thing.
> It definitely would be a good thing. But that would definitely be
> seperate patch. But I fear my current leel of knowledge is
> sufficient and also I am not sure if I can make myself interested
> enough in that part.

Fair enough. I won't be looking for new comments for the old code.

>> By the way, I found one typo in the comments -- it should by
>> useful, not usefull.
> Ok, will update.

Given how trivial that is, I'm putting this back in "Needs Review"
status, and resuming my review work. Barring surprises, I should wrap
this up whenever I can free up a two or three hours.

-Kevin


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-07 01:12:43
Message-ID: 4B1C568B.1030709@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

After getting off to a good start, it looks like this patch is now stuck
waiting for a second review pass from Kevin right now, with no open
items for Andres to correct. Since the only issues on the table seem to
be that of code aesthetics and long-term planning for this style of
implementation rather than specific functional bits, I'm leaning toward
saying this one is ready to have a committer look at it. Any comments
from Kevin or Andres about where this is at?

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-07 11:03:38
Message-ID: 200912071203.38308.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday 07 December 2009 02:12:43 Greg Smith wrote:
> After getting off to a good start, it looks like this patch is now stuck
> waiting for a second review pass from Kevin right now, with no open
> items for Andres to correct. Since the only issues on the table seem to
> be that of code aesthetics and long-term planning for this style of
> implementation rather than specific functional bits, I'm leaning toward
> saying this one is ready to have a committer look at it. Any comments
> from Kevin or Andres about where this is at?
I think it should be ready - the only know thing it needs is a
s/usefull/useful/.

I will take another look but I doubt I will see anything new.

Andres


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <greg(at)2ndquadrant(dot)com>
Cc: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-hackers(at)postgresql(dot)org>,<oleg(at)sai(dot)msu(dot)su>, <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-12-07 15:32:15
Message-ID: 4B1CCB9F020000250002D10B@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

> After getting off to a good start, it looks like this patch is now
> stuck waiting for a second review pass from Kevin right now, with
> no open items for Andres to correct. Since the only issues on the
> table seem to be that of code aesthetics and long-term planning
> for this style of implementation rather than specific functional
> bits, I'm leaning toward saying this one is ready to have a
> committer look at it. Any comments from Kevin or Andres about
> where this is at?

Yeah, really the only thing I found to complain about was one
misspelled word in a comment. I am currently the hold-up, due to
fighting off a bout of some virus and having other "real world"
issues impinge. The only thing left to do, besides correcting the
spelling, is to confirm the author's performance improvements and
confirm that there is no degradation in a non-targeted situation.

Frankly, I'd be amazed if there was a performance regression,
because all it really does is pass a pointer to a new spot in an
existing input buffer rather than allocating new space and copying
the input from the desired spot to the end of the buffer. I can't
think of any situations where calculating the new address should be
slower than calculating the new address and copying from there to
the end of the buffer.

-Kevin