Re: tsearch parser inefficiency if text includes urls or emails - new version

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: "Andres Freund" <andres(at)anarazel(dot)de>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>,<teodor(at)sigaev(dot)ru>
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Date: 2009-11-14 00:03:33
Message-ID: 4AFD9F75020000250002C84B@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
>> While playing around/evaluating tsearch I notices that to_tsvector
>> is obscenely slow for some files. After some profiling I found that
>> this is due using a seperate TSParser in p_ishost/p_isURLPath in
>> wparser_def.c. If a multibyte encoding is in use TParserInit copies
>> the whole remaining input and converts it to wchar_t or pg_wchar -
>> for every email or protocol prefixed url in the the document. Which
>> obviously is bad.
>>
>> I solved the issue by having a seperate TParserCopyInit/
>> TParserCopyClose which reuses the the already converted strings of
>> the original TParser - only at different offsets.
>>
>> Another approach would be to get rid of the separate parser
>> invocations - requiring a bunch of additional states. This seemed
>> more complex to me, so I wanted to get some feedback first.

>> Without patch:

>> Time: 5835.676 ms

>> With patch:

>> Time: 395.341 ms

> As nobody commented here is a corrected (stupid thinko) and cleaned
> up version.

I've taken this one for review, and have taken a preliminary look.

It is in context format, applies cleanly, and passes "make check".
Next I read through the code, and have the same question that Andres
posed 12 days ago. His patch massively reduces the cost of the parser
recursively calling itself for some cases, and it seems like the least
invasive way to modify the parser to solve this performance problem;
but it does beg the question of why a state machine like this should
recursively call itself when it hits certain states.

The patch is mitigating the penalty for what seems to me to be an
existing ugly kludge. Is it acceptable to address the problem in this
way, or should the much more invasive work be done to eliminate the
kludge? (Note: I personally would much rather see the performance
penalty addressed this way, and a TODO added for the more invasive
work, than to leave this alone for the next release if there's nobody
willing to tackle the problem at a more fundamental level.)

If nobody has a contrary opinion, I'll proceed with the review of this
patch and add something to the TODO page for the more invasive work.

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2009-11-14 00:05:14 Re: tsearch parser inefficiency if text includes urls or emails - new version
Previous Message Andrew Gierth 2009-11-13 23:49:11 Aggregate ORDER BY docs patch, first attempt