Re: Very bad FTS performance with the Polish config

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Wojciech Knapik <webmaster(at)wolniartysci(dot)pl>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Very bad FTS performance with the Polish config
Date: 2009-11-19 15:51:13
Message-ID: 15251.1258645873@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Wojciech Knapik <webmaster(at)wolniartysci(dot)pl> writes:
> Tom Lane wrote:
>> I tried to duplicate this test, but got no further than here:
>> ERROR: syntax error
>> CONTEXT: line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": " L E C > -C,GEM #zalec (15a)

> Here are the files I used (polish.affix, polish.dict already generated):
> http://wolniartysci.pl/pl.tar.gz

Your files were the same as mine. I eventually figured out the problem
was I was using C locale, in which some of those letters aren't letters.
(I wonder whether the tsearch config file parsers could be made less
sensitive to this by avoiding t_isalpha tests.) In pl_PL.ut8 locale
I could see that the example is indeed much slower. Oleg is right that
the fundamental difference is that this Polish configuration is using
an ispell dictionary where the simple English configuration is not.
But, just for the record, here's what an oprofile profile looks like:

samples % image name symbol name
7480 20.9477 postgres RS_execute
5370 15.0386 postgres pg_utf_mblen
4138 11.5884 postgres pg_mblen
3756 10.5187 postgres mb_strchr
2880 8.0654 postgres FindWord
2754 7.7126 postgres CheckAffix
1576 4.4136 postgres NormalizeSubWord
966 2.7053 postgres FindAffixes
896 2.5092 postgres TParserGet
742 2.0780 postgres AllocSetAlloc
420 1.1762 postgres AllocSetFree
396 1.1090 postgres addHLParsedLex
384 1.0754 postgres LexizeExec

So about 55% of the time is going into affix pattern matching.
I wonder whether that couldn't be made faster. A lot of the cycles
are spent on coping with variable-length characters --- perhaps the
ispell code should convert to wchar representation before doing this?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-11-19 15:53:54 Re: Syntax for partitioning
Previous Message Guillaume Lelarge 2009-11-19 15:47:24 Patch to change a pg_restore message