Unaccent performance

From: Thom Brown <thom(at)linux(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Unaccent performance
Date: 2013-06-21 18:04:04
Message-ID: CAA-aLv41dV78AxFWZgXdYdi9-zJVbhVyx6DijGfqeJ-87UGjdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The unaccent extension is great, especially with its customisability, but
it's not always easy to recommend. I witnessed a customer using no less
than 56 nested replace functions in an SQL function. I looked to see how
much this can be mitigated by unaccent. It turns out that not all the
characters they were replacing can be replaced by unaccent, either because
they replace more than 1 character at a time, or the character they're
replacing, for some reason, isn't processed by unaccent, even with a custom
rules file.

So there were 20 characters I could identify that they were replacing. I
made a custom rules file and compared its performance to the
difficult-to-manage set of nested replace calls.

CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
RETURNS text
LANGUAGE sql
IMMUTABLE
AS $function$
SELECT
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
;
$function$

postgres=# SELECT myunaccent(sometext::text) FROM (SELECT
'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET
999999 LIMIT 1;
myunaccent
----------------------
AAAAAAAaaaaaaaAaAaAa
(1 row)

Time: 726.282 ms
postgres=# SELECT unaccent(sometext::text) FROM (SELECT
'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET
999999 LIMIT 1;
unaccent
----------------------
AAAAAAAaaaaaaaAaAaAa
(1 row)

Time: 3305.252 ms

The timings are actually pretty much the same even if I introduce 187
nested replace calls for every line in the unaccent.rules file for 187
characters. But the same character set with unaccent increases to 7418.526
ms with the same type of query as above. That's 10 times more expensive.

Is there a way to boost the performance to make its adoption more palatable?

--
Thom

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-21 18:11:47 Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Previous Message David Fetter 2013-06-21 17:48:35 Review [was Re: MD5 aggregate]