Re: tsearch parser inefficiency if text includes urls or emails - new version

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <andres(at)anarazel(dot)de>,<pgsql-hackers(at)postgresql(dot)org>
Cc: <oleg(at)sai(dot)msu(dot)su>,<teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-11-14 14:33:00
Message-ID: 4AFE6B3C020000250002C86C@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund wrote:
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
>> It is in context format, applies cleanly, and passes "make check".
> Unfortunately the latter is not saying much - I had a bug there and
> it was not found by the regression tests. Perhaps I should take a
> stab and add at least some more...

Sounds like a good idea. The one thing to avoid is anything with a
long enough run time to annoy those that run it many times a day.

>> It is in context format, applies cleanly, and passes "make check".
>> Next I read through the code, and have the same question that
>> Andres posed 12 days ago. His patch massively reduces the cost of
>> the parser recursively calling itself for some cases, and it seems
>> like the least invasive way to modify the parser to solve this
>> performance problem; but it does beg the question of why a state
>> machine like this should recursively call itself when it hits
>> certain states.
> I was wondering about that as well. I am not completely sure but to
> me it looks like its just done to reduce the amount of rules and
> states.

I'm assuming that's the reason, but didn't dig deep enough to be sure.
I suspect to be really sure, I'd have to set it up without the
recursion and see what breaks. I can't imagine it would be anything
which couldn't be fixed by adding enough states; but perhaps they ran
into something where these types would require so many new states that
the recursion seemed like the lesser of evils.

> I have to say that that code is not exactly clear and well
> documented...

Yeah. I was happy with the level of documentation that you added with
your new code, but what was there before is mighty thin. If you
gleaned enough information while working on it to feel comfortable
adding documentation anywhere else, that would be a good thing.

So far the only vote is to proceed with the mitigation, which was my
preference, and apparently yours -- so I guess we're at 3 to 0 in
favor of that. I'll mark the patch as "Waiting on Author" so you can
add any comments and regression tests you feel are appropriate.

By the way, I found one typo in the comments -- it should by useful,
not usefull.

I liked what I saw so far, but need to spend more time desk-checking
for correctness, testing to confirm that it doesn't change results,
and confirming the performance improvement.

-Kevin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2009-11-14 15:12:23 Re: [patch] pg_ctl init extension
Previous Message Magnus Hagander 2009-11-14 14:31:47 Re: Patch committers