Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Date: 2014-08-25 14:41:19
Message-ID: 53FB4B0F.9080502@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/12/2014 05:16 AM, Jeff Davis wrote:
> On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
>> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>>> Attached is a small patch to $SUBJECT.
>>> In master, only single-byte characters are allowed as an escape. Of
>>> course, with the patch it must still be a single character, but it may
>>> be multi-byte.
>>
>> I'm concerned about the performance cost of this patch. Have you done
>> any measurements about what kind of overhead you are putting on the
>> inner loop of similar_escape?
>
> I didn't consider this very performance critical, because this is
> looping through the pattern, which I wouldn't expect to be a long
> string. On my machine using en_US.UTF-8, the difference is imperceptible
> for a SIMILAR TO ... ESCAPE query.
>
> I was able to see about a 2% increase in runtime when using the
> similar_escape function directly. I made a 10M tuple table and did:
>
> explain analyze
> select
> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;
>
> which was the worst reasonable case I could think of.

Actually, that gets optimized to a constant in the planner:

postgres=# explain verbose select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
from t;
QUERY PLAN

--------------------------------------------------------------------------------
----------
Seq Scan on public.t (cost=0.00..144247.85 rows=9999985 width=0)
Output:
'^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ
)$'::text
Planning time: 0.033 ms
(3 rows)

With a working test case:

create table t (pattern text);
insert into t select
'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' from
generate_series(1, 1000000);
vacuum t;

explain (analyze) select similar_escape(pattern,'#') from t;

your patch seems to be about 2x-3x as slow as unpatched master. So this
needs some optimization. A couple of ideas:

1. If the escape string is in fact a single-byte character, you can
proceed with the loop just as it is today, without the pg_mblen calls.

2. Since pg_mblen() will always return an integer between 1-6, it would
probably be faster to replace the memcpy() and memcmp() calls with
simple for-loops iterating byte-by-byte.

In very brief testing, with the 1. change above, the performance with
this patch is back to what it's without the patch. See attached.

- Heikki

Attachment Content-Type Size
similar-escape-2.patch text/x-diff 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-08-25 14:51:51 Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Previous Message Petr Jelinek 2014-08-25 14:15:07 Re: Built-in binning functions