Re: LIKE optimization in UTF-8 and locale-C

Lists: pgsql-hackerspgsql-patches
From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-22 05:41:35
Message-ID: 20070322143734.62FA.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hello,

I found LIKE operators are slower on multi-byte encoding databases
than single-byte encoding ones. It comes from difference between
MatchText() and MBMatchText().

We've had an optimization for single-byte encodings using
pg_database_encoding_max_length() == 1 test. I'll propose to extend it
in UTF-8 with locale-C case. All of trailing bytes are different from first
bytes in UTF-8 multi-byte characters, so we can use functions for single-bytes
and byte-wise comparison in the case. With the attached patch, the performance
of UTF-8 LIKE operators are pushed up to near other single-bytes encodings.

Databases initialized with locale-C are widely used in Japan, because
Japanese locale are broken in almost of platforms. Japanese user can
choose EUC-jp or UTF-8 as a server encoding, but I think UTF-8 will be
more and more used in the future.

---- test ----

initdb --no-locale --encoding=<encoding>

[HEAD]
SQL_ASCII : 7171 ms / 7203 ms / 7187 ms
LATIN1 : 7172 ms / 7156 ms / 7141 ms
UTF8 : 16235 ms / 16281 ms / 16281 ms
EUC_JP : 17454 ms / 17453 ms / 17438 ms

[with patch]
SQL_ASCII : 7062 ms / 7125 ms / 7125 ms
LATIN1 : 7047 ms / 7063 ms / 7047 ms
UTF8 : 7188 ms / 7234 ms / 7235 ms
EUC_JP : 17468 ms / 17453 ms / 17453 ms

CREATE OR REPLACE FUNCTION test() RETURNS INTEGER AS $$
DECLARE
cnt integer;
BEGIN
FOR i IN 1..1000 LOOP
SELECT count(*) INTO cnt FROM item WHERE i_title LIKE '%BABABABABARIBA%' LIMIT 50;
END LOOP;
RETURN cnt;
END;
$$ LANGUAGE plpgsql;

SELECT count(*) FROM item; -- borrowed from DBT-1 (TPC-W)
count
-------
10000
(1 row)

---- patch ----

Index: src/backend/utils/adt/like.c
===================================================================
--- src/backend/utils/adt/like.c (head)
+++ src/backend/utils/adt/like.c (working copy)
@@ -21,6 +21,7 @@

#include "mb/pg_wchar.h"
#include "utils/builtins.h"
+#include "utils/pg_locale.h"


#define LIKE_TRUE 1
@@ -119,6 +120,13 @@


/*
+ * true iff match functions for single-byte characters are available.
+ */
+#define sb_match_available() \
+ (pg_database_encoding_max_length() == 1 || \
+ (lc_collate_is_c() && GetDatabaseEncoding() == PG_UTF8))
+
+/*
* interface routines called by the function manager
*/

@@ -138,7 +146,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);
@@ -162,7 +170,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);
@@ -186,7 +194,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);
@@ -210,7 +218,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);
@@ -275,7 +283,7 @@
int slen,
plen;

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = NameStr(*str);
slen = strlen(s);
@@ -316,7 +324,7 @@
int slen,
plen;

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = NameStr(*str);
slen = strlen(s);
@@ -357,7 +365,7 @@
int slen,
plen;

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = VARDATA(str);
slen = (VARSIZE(str) - VARHDRSZ);
@@ -393,7 +401,7 @@
int slen,
plen;

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = VARDATA(str);
slen = (VARSIZE(str) - VARHDRSZ);
@@ -429,7 +437,7 @@
text *esc = PG_GETARG_TEXT_P(1);
text *result;

- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = do_like_escape(pat, esc);
else
result = MB_do_like_escape(pat, esc);

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-22 15:08:08
Message-ID: 3049.1174576088@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> I found LIKE operators are slower on multi-byte encoding databases
> than single-byte encoding ones. It comes from difference between
> MatchText() and MBMatchText().

> We've had an optimization for single-byte encodings using
> pg_database_encoding_max_length() == 1 test. I'll propose to extend it
> in UTF-8 with locale-C case.

If this works for UTF8, won't it work for all the backend-legal
encodings?

regards, tom lane


From: Hannu Krosing <hannu(at)skype(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-22 20:11:09
Message-ID: 1174594269.3826.6.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Ühel kenal päeval, N, 2007-03-22 kell 11:08, kirjutas Tom Lane:
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> > I found LIKE operators are slower on multi-byte encoding databases
> > than single-byte encoding ones. It comes from difference between
> > MatchText() and MBMatchText().
>
> > We've had an optimization for single-byte encodings using
> > pg_database_encoding_max_length() == 1 test. I'll propose to extend it
> > in UTF-8 with locale-C case.
>
> If this works for UTF8, won't it work for all the backend-legal
> encodings?

I guess it works well for % but not for _ , the latter has to know, how
many bytes the current (multibyte) character covers.

The length is still easy to find out for UTF8 encoding, so it may be
feasible to write UTF8MatchText() that is still faster than
MBMatchText().

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hannu Krosing <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-23 03:25:25
Message-ID: 20070323104410.635A.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hannu Krosing <hannu(at)skype(dot)net> wrote:

> > > We've had an optimization for single-byte encodings using
> > > pg_database_encoding_max_length() == 1 test. I'll propose to extend it
> > > in UTF-8 with locale-C case.
> >
> > If this works for UTF8, won't it work for all the backend-legal
> > encodings?
>
> I guess it works well for % but not for _ , the latter has to know, how
> many bytes the current (multibyte) character covers.

Yes, % is not used in trailing bytes for all encodings, but _ is
used in some of them. I think we can use the optimization for all
of the server encodings except JOHAB.

Also, I took notice that locale settings are not used in LIKE matching,
so the following is enough for checking availability of byte-wise matching
functions. or am I missing something?

#define sb_match_available() (GetDatabaseEncoding() == PG_JOHAB))

Multi-byte encodings supported by a server encoding.

| % 0x25 | _ 0x5f | \ 0x5c |
--------------+--------+--------+--------+-
EUC_JP | unused | unused | unused |
EUC_CN | unused | unused | unused |
EUC_KR | unused | unused | unused |
EUC_TW | unused | unused | unused |
JOHAB | unused | *used* | *used* |
UTF8 | unused | unused | unused |
MULE_INTERNAL | unused | unused | unused |

Just for reference, encodings only supported as a client encoding.

| % 0x25 | _ 0x5f | \ 0x5c |
--------------+--------+--------+--------+-
SJIS | unused | *used* | *used* |
BIG5 | unused | *used* | *used* |
GBK | unused | *used* | *used* |
UHC | unused | unused | unused |
GB18030 | unused | *used* | *used* |

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hannu Krosing <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-23 05:17:26
Message-ID: 460362E6.2040208@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro skrev:

>> I guess it works well for % but not for _ , the latter has to know, how
>> many bytes the current (multibyte) character covers.
>
> Yes, % is not used in trailing bytes for all encodings, but _ is
> used in some of them. I think we can use the optimization for all
> of the server encodings except JOHAB.

The problem with the like pattern _ is that it has to know how long the
single caracter is that it should pass over. Say you have a UTF-8 string
with 2 characters encoded in 3 bytes ('ÖA'). Where the first character
is 2 bytes:

0xC3 0x96 'A'

and now you want to match that with the LIKE pattern:

'_A'

How would that work in the C locale?

Maybe one should simply write a special version of LIKE for the UTF-8
encoding since it's probably the most used encoding today. But I don't
think you can use the C locale and that it would work for UTF-8.

/Dennis


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-23 05:45:47
Message-ID: 20070323142444.6368.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> wrote:

> The problem with the like pattern _ is that it has to know how long the
> single caracter is that it should pass over. Say you have a UTF-8 string
> with 2 characters encoded in 3 bytes ('ÖA'). Where the first character
> is 2 bytes:
>
> 0xC3 0x96 'A'
>
> and now you want to match that with the LIKE pattern:
>
> '_A'

Thanks, it all made sense to me. My proposal was completely wrong.
The optimization of MBMatchText() seems to be the right way...

> Maybe one should simply write a special version of LIKE for the UTF-8
> encoding since it's probably the most used encoding today. But I don't
> think you can use the C locale and that it would work for UTF-8.

But then, present LIKE matching is not locale aware. we treat multi-byte
characters properly, but always perform a char-by-char comparison.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-23 06:00:20
Message-ID: slrnf06r7k.7me.andrew+nonews@atlantis.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On 2007-03-22, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
>> I found LIKE operators are slower on multi-byte encoding databases
>> than single-byte encoding ones. It comes from difference between
>> MatchText() and MBMatchText().
>
>> We've had an optimization for single-byte encodings using
>> pg_database_encoding_max_length() == 1 test. I'll propose to extend it
>> in UTF-8 with locale-C case.
>
> If this works for UTF8, won't it work for all the backend-legal
> encodings?

It works for UTF8 only because UTF8 has special properties which are not
shared by, for example, EUC_*. Specifically, in UTF8 the octet sequence
for a multibyte character will never appear as a subsequence of the octet
sequence of a string of other multibyte characters. i.e. given a string
of two two-octet characters AB, the second octet of A plus the first octet
of B is not a valid UTF8 character (and likewise for longer characters).

(And while I haven't tested it, it looks like the patch posted doesn't
account properly for the use of _, so it needs a bit more work.)

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-23 06:10:39
Message-ID: slrnf06rqv.7me.andrew+nonews@atlantis.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On 2007-03-23, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Thanks, it all made sense to me. My proposal was completely wrong.

Actually, I think your proposal is fundamentally correct, merely incomplete.

Doing octet-based rather than character-based matching of strings is a
_design goal_ of UTF8. Treating UTF8 like any other multibyte charset and
converting everything to wide-chars is, in my opinion, always going to
result in suboptimal performance.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Hannu Krosing <hannu(at)skype(dot)net>
To: andrew(at)supernews(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: LIKE optimization in UTF-8 and locale-C
Date: 2007-03-25 18:18:19
Message-ID: 1174846699.3344.8.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Ühel kenal päeval, R, 2007-03-23 kell 06:10, kirjutas Andrew -
Supernews:
> On 2007-03-23, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> > Thanks, it all made sense to me. My proposal was completely wrong.
>
> Actually, I think your proposal is fundamentally correct, merely incomplete.
>
> Doing octet-based rather than character-based matching of strings is a
> _design goal_ of UTF8. Treating UTF8 like any other multibyte charset and
> converting everything to wide-chars is, in my opinion, always going to
> result in suboptimal performance.

Yes, that was what I meant by proposing a utf8 specific UTF8MatchText(),
which should not convert everything to wide char, but instead do
byte-by-byte comparison and just be aware of UTF encoding, where it is
easy to know how wide (how maby bytes/octets) each encoded character
takes.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-patches(at)postgresql(dot)org, andrew+nonews(at)supernews(dot)com
Subject: Multibyte LIKE optimization
Date: 2007-03-30 08:40:08
Message-ID: 20070330142456.77F9.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew - Supernews <andrew+nonews(at)supernews(dot)com> wrote:

> Actually, I think your proposal is fundamentally correct, merely incomplete.

Yeah, I fixed the patch to handle '_' correctly.

> Doing octet-based rather than character-based matching of strings is a
> _design goal_ of UTF8.

I think all "safe ASCII-supersets" encodings are comparable by bytes,
not only UTF-8. Their all multibyte characters consist of bytes larger
than 127. I updated the patch on this presupposition. It uses octet-based
matching usually and character-based matching at '_'.

There was 30%+ of performance win in selection using multibytes LIKE '%foo%'.

encoding | HEAD | patched
-----------+---------+---------
SQL_ASCII | 7094ms | 7062ms
LATIN1 | 7083ms | 7078ms
UTF8 | 17974ms | 11635ms (64.7%)
EUC_JP | 17032ms | 12109ms (71.1%)

If this patch is acceptable, please drop JOHAB encoding from server encodings
before it is applied. Trailing bytes of JOHAB can be less than 128.
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01475.php

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment Content-Type Size
mbtextmatch.patch application/octet-stream 6.7 KB

From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Andrew - Supernews" <andrew(at)supernews(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: UTF8MatchText
Date: 2007-04-02 04:56:04
Message-ID: 20070402133445.DDF8.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Andrew - Supernews" <andrew(at)supernews(dot)net> wrote:

> ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> ITAGAKI> by bytes, not only UTF-8.
>
> This is false, particularly for EUC.

Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining hints that are useful on my machine (Pentium 4).

x1000 of LIKE '%foo% on 10000 rows tables [ms]
encoding | HEAD | P1 | P2 | P3
-----------+-------+-------+-------+-------
SQL_ASCII | 7094 | 7120 | 7063 | 7031
LATIN1 | 7083 | 7130 | 7057 | 7031
UTF8 | 17974 | 10859 | 10839 | 9682
EUC_JP | 17032 | 17557 | 17599 | 15240

- P1: UTF8MatchText()
- P2: P1 + __inline__ GenericMatchText()
- P3: P2 + __inline__ wchareq()
(The attached patch is P3.)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment Content-Type Size
utf8matchtext.patch application/octet-stream 17.4 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Andrew - Supernews <andrew(at)supernews(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-04-02 22:54:43
Message-ID: 200704022254.l32Msh914608@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


I assume this replaces all your earlier multi-byte LIKE patches.

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <andrew(at)supernews(dot)net> wrote:
>
> > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > ITAGAKI> by bytes, not only UTF-8.
> >
> > This is false, particularly for EUC.
>
> Umm, I see. I updated the optimization to be used only for UTF8 case.
> I also added some inlining hints that are useful on my machine (Pentium 4).
>
>
> x1000 of LIKE '%foo% on 10000 rows tables [ms]
> encoding | HEAD | P1 | P2 | P3
> -----------+-------+-------+-------+-------
> SQL_ASCII | 7094 | 7120 | 7063 | 7031
> LATIN1 | 7083 | 7130 | 7057 | 7031
> UTF8 | 17974 | 10859 | 10839 | 9682
> EUC_JP | 17032 | 17557 | 17599 | 15240
>
> - P1: UTF8MatchText()
> - P2: P1 + __inline__ GenericMatchText()
> - P3: P2 + __inline__ wchareq()
> (The attached patch is P3.)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Andrew - Supernews <andrew(at)supernews(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-04-07 04:19:28
Message-ID: 200704070419.l374JSg19198@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:

#define CHAREQ(p1, p2) wchareq(p1, p2)
#define ICHAREQ(p1, p2) wchareq(p1, p2)

Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case it be ignored?

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <andrew(at)supernews(dot)net> wrote:
>
> > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > ITAGAKI> by bytes, not only UTF-8.
> >
> > This is false, particularly for EUC.
>
> Umm, I see. I updated the optimization to be used only for UTF8 case.
> I also added some inlining hints that are useful on my machine (Pentium 4).
>
>
> x1000 of LIKE '%foo% on 10000 rows tables [ms]
> encoding | HEAD | P1 | P2 | P3
> -----------+-------+-------+-------+-------
> SQL_ASCII | 7094 | 7120 | 7063 | 7031
> LATIN1 | 7083 | 7130 | 7057 | 7031
> UTF8 | 17974 | 10859 | 10839 | 9682
> EUC_JP | 17032 | 17557 | 17599 | 15240
>
> - P1: UTF8MatchText()
> - P2: P1 + __inline__ GenericMatchText()
> - P3: P2 + __inline__ wchareq()
> (The attached patch is P3.)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Andrew - Supernews <andrew(at)supernews(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-04-07 04:23:53
Message-ID: 200704070423.l374NrX19805@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian wrote:
>
> I do not understand this patch. You have defined two functions,
> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> is that one calls CHAREQ and the other calls ICHAREQ, but just above
> those two functions you define the macros identically:
>
> #define CHAREQ(p1, p2) wchareq(p1, p2)
> #define ICHAREQ(p1, p2) wchareq(p1, p2)
>
> Why are there two functions? Also, can't you use one function and just
> pass a boolean to indicate whether case it be ignored?

Sorry, typo:

Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?
------

>
> ---------------------------------------------------------------------------
>
> ITAGAKI Takahiro wrote:
> > "Andrew - Supernews" <andrew(at)supernews(dot)net> wrote:
> >
> > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > > ITAGAKI> by bytes, not only UTF-8.
> > >
> > > This is false, particularly for EUC.
> >
> > Umm, I see. I updated the optimization to be used only for UTF8 case.
> > I also added some inlining hints that are useful on my machine (Pentium 4).
> >
> >
> > x1000 of LIKE '%foo% on 10000 rows tables [ms]
> > encoding | HEAD | P1 | P2 | P3
> > -----------+-------+-------+-------+-------
> > SQL_ASCII | 7094 | 7120 | 7063 | 7031
> > LATIN1 | 7083 | 7130 | 7057 | 7031
> > UTF8 | 17974 | 10859 | 10839 | 9682
> > EUC_JP | 17032 | 17557 | 17599 | 15240
> >
> > - P1: UTF8MatchText()
> > - P2: P1 + __inline__ GenericMatchText()
> > - P3: P2 + __inline__ wchareq()
> > (The attached patch is P3.)
> >
> > Regards,
> > ---
> > ITAGAKI Takahiro
> > NTT Open Source Software Center
> >
>
> [ Attachment, skipping... ]
>
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: You can help support the PostgreSQL project by donating at
> >
> > http://www.postgresql.org/about/donate
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://www.enterprisedb.com
>
> + If your life is a hard drive, Christ can be your backup. +
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-04-09 02:48:44
Message-ID: 20070409111732.8786.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> > I do not understand this patch. You have defined two functions,
> > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> > is that one calls CHAREQ and the other calls ICHAREQ, but just above
> > those two functions you define the macros identically:
>
> Why are there two functions? Also, can't you use one function and just
> pass a boolean to indicate whether case should be ignored?

The same is true of MBMatchText() and MBMatchTextIC().
Now, I'll split the patch into two changes.

1. DropMBMatchTextIC.patch
Drop MBMatchTextIC() and use MBMatchText() instead.

2. UTF8MatchText.patch
Add UTF8MatchText() as a specialized version of MBMatchText().

As a future work, it might be good to research the performance of rewriting
"col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
we can avoid to call lower() for constant pattern in the right-hand side and
can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
in the future unless we move to wide-character server encoding :)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment Content-Type Size
DropMBMatchTextIC.patch application/octet-stream 17.1 KB
UTF8MatchText.patch application/octet-stream 3.9 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Andrew - Supernews <andrew(at)supernews(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-04-09 20:16:10
Message-ID: 200704092016.l39KGAi00878@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Patch removed, updated version submitted.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <andrew(at)supernews(dot)net> wrote:
>
> > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > ITAGAKI> by bytes, not only UTF-8.
> >
> > This is false, particularly for EUC.
>
> Umm, I see. I updated the optimization to be used only for UTF8 case.
> I also added some inlining hints that are useful on my machine (Pentium 4).
>
>
> x1000 of LIKE '%foo% on 10000 rows tables [ms]
> encoding | HEAD | P1 | P2 | P3
> -----------+-------+-------+-------+-------
> SQL_ASCII | 7094 | 7120 | 7063 | 7031
> LATIN1 | 7083 | 7130 | 7057 | 7031
> UTF8 | 17974 | 10859 | 10839 | 9682
> EUC_JP | 17032 | 17557 | 17599 | 15240
>
> - P1: UTF8MatchText()
> - P2: P1 + __inline__ GenericMatchText()
> - P3: P2 + __inline__ wchareq()
> (The attached patch is P3.)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-04-10 01:07:49
Message-ID: 200704100107.l3A17nL16507@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
> > > I do not understand this patch. You have defined two functions,
> > > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> > > is that one calls CHAREQ and the other calls ICHAREQ, but just above
> > > those two functions you define the macros identically:
> >
> > Why are there two functions? Also, can't you use one function and just
> > pass a boolean to indicate whether case should be ignored?
>
> The same is true of MBMatchText() and MBMatchTextIC().
> Now, I'll split the patch into two changes.
>
> 1. DropMBMatchTextIC.patch
> Drop MBMatchTextIC() and use MBMatchText() instead.
>
> 2. UTF8MatchText.patch
> Add UTF8MatchText() as a specialized version of MBMatchText().
>
>
> As a future work, it might be good to research the performance of rewriting
> "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
> we can avoid to call lower() for constant pattern in the right-hand side and
> can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
> in the future unless we move to wide-character server encoding :)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

[ Attachment, skipping... ]

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 14:20:50
Message-ID: 464C64C2.6000804@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Itagaki,

I find this still fairly unclean. It certainly took me some time to get
me head around what's going on.

ISTM we should generate all these match functions from one body of code
plus some #define magic.

As I understand it, we have three possible encoding switches: Single
Byte, UTF8 and other Multi Byte Charsets, and two possible case
settings: case Sensitive and Case Insensitive. That would make for a
total of six functions, but in the case of both UTF8 and other MBCS we
don't need a special Case Insensitive function - instead we downcase
both the string and the pattern and then use the Case Sensitive
function. That leaves a total of four functions.

What is not clear to me is why the UTF8 optimisation work, and why it
doesn't apply to other MBCS. At the very least we need a comment on that.

I also find the existing function naming convention somewhat annoying -
having foo() and MB_foo() is less than clear. I'd rather have SB_foo()
and MB_foo(). That's not your fault, of course.

If you supply me with some explanation on the UTF8 optimisation issue,
I'll prepare a revised patch along these lines.

cheers

andrew

ITAGAKI Takahiro wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
>
>>> I do not understand this patch. You have defined two functions,
>>> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
>>> is that one calls CHAREQ and the other calls ICHAREQ, but just above
>>> those two functions you define the macros identically:
>>>
>> Why are there two functions? Also, can't you use one function and just
>> pass a boolean to indicate whether case should be ignored?
>>
>
> The same is true of MBMatchText() and MBMatchTextIC().
> Now, I'll split the patch into two changes.
>
> 1. DropMBMatchTextIC.patch
> Drop MBMatchTextIC() and use MBMatchText() instead.
>
> 2. UTF8MatchText.patch
> Add UTF8MatchText() as a specialized version of MBMatchText().
>
>
> As a future work, it might be good to research the performance of rewriting
> "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
> we can avoid to call lower() for constant pattern in the right-hand side and
> can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
> in the future unless we move to wide-character server encoding :)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 16:56:55
Message-ID: 464C8957.6070204@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


I wrote:
>
>
> ISTM we should generate all these match functions from one body of
> code plus some #define magic.
>
> As I understand it, we have three possible encoding switches: Single
> Byte, UTF8 and other Multi Byte Charsets, and two possible case
> settings: case Sensitive and Case Insensitive. That would make for a
> total of six functions, but in the case of both UTF8 and other MBCS we
> don't need a special Case Insensitive function - instead we downcase
> both the string and the pattern and then use the Case Sensitive
> function. That leaves a total of four functions.
>
> What is not clear to me is why the UTF8 optimisation work, and why it
> doesn't apply to other MBCS. At the very least we need a comment on that.
>
> I also find the existing function naming convention somewhat annoying
> - having foo() and MB_foo() is less than clear. I'd rather have
> SB_foo() and MB_foo(). That's not your fault, of course.
>
> If you supply me with some explanation on the UTF8 optimisation issue,
> I'll prepare a revised patch along these lines.
>
>

Ok, I have studied some more and I think I understand what's going on.
AIUI, we are switching from some expensive char-wise comparisons to
cheap byte-wise comparisons in the UTF8 case because we know that in
UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
character sequence. Is that putting it too mildly? Do we need stronger
conditions than that? If it's correct, are there other MBCS for which
this is true?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 17:33:08
Message-ID: 3999.1179423188@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Ok, I have studied some more and I think I understand what's going on.
> AIUI, we are switching from some expensive char-wise comparisons to
> cheap byte-wise comparisons in the UTF8 case because we know that in
> UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
> character sequence. Is that putting it too mildly? Do we need stronger
> conditions than that? If it's correct, are there other MBCS for which
> this is true?

I don't think this is a correct analysis. If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes. But it was
stated somewhere upthread that that doesn't actually work. Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 17:39:41
Message-ID: 4081.1179423581@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
place, it would match strings of four *bytes*. Which is not the correct
behavior.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 17:48:10
Message-ID: 464C955A.6050402@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> UTF8 has disjoint representations for
> first-bytes and not-first-bytes of MB characters, and thus it is
> impossible to make a false match in which an MB pattern character is
> matched to the end of one data character plus the start of another.
> In character sets without that property, we have to use the slow way to
> ensure we don't make out-of-sync matches.
>
>
>

Thanks. I will include this info in the comments.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 18:06:08
Message-ID: 464C9990.5010000@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Wait a second ... I just thought of a counterexample that destroys the
> entire concept. Consider the pattern 'A__B', which clearly is supposed
> to match strings of four *characters*. With the proposed patch in
> place, it would match strings of four *bytes*. Which is not the correct
> behavior.
>
>

From what I can see the code is quite careful about when it calls
NextByte vs NextChar, and after _ it calls NextChar.

So I'll test for this, but I think it's safe.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 18:16:51
Message-ID: 4800.1179425811@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Wait a second ... I just thought of a counterexample that destroys the
>> entire concept. Consider the pattern 'A__B', which clearly is supposed
>> to match strings of four *characters*. With the proposed patch in
>> place, it would match strings of four *bytes*. Which is not the correct
>> behavior.

> From what I can see the code is quite careful about when it calls
> NextByte vs NextChar, and after _ it calls NextChar.

Except that the entire point of this patch is to dumb down NextChar to
be the same as NextByte for UTF8 strings.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 18:36:50
Message-ID: 464CA0C2.4010700@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> Wait a second ... I just thought of a counterexample that destroys the
>>> entire concept. Consider the pattern 'A__B', which clearly is supposed
>>> to match strings of four *characters*. With the proposed patch in
>>> place, it would match strings of four *bytes*. Which is not the correct
>>> behavior.
>>>
>
>
>> From what I can see the code is quite careful about when it calls
>> NextByte vs NextChar, and after _ it calls NextChar.
>>
>
> Except that the entire point of this patch is to dumb down NextChar to
> be the same as NextByte for UTF8 strings.
>
>
>

That's not what I see in (what I think is) the latest submission, which
includes this snippet:

+ /* Set up for utf8 characters */
+ #define CHAREQ(p1, p2) wchareq(p1, p2)
+ #define NextChar(p, plen) \
+ do { int __l = pg_utf_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
+
+ /*
+ * UTF8MatchText -- specialized version of MBMatchText for UTF8
+ */
+ static int
+ UTF8MatchText(char *t, int tlen, char *p, int plen)

Am I looking at the wrong thing? This is from around April 9th I think.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 19:48:47
Message-ID: 12299.1179431327@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Except that the entire point of this patch is to dumb down NextChar to
>> be the same as NextByte for UTF8 strings.

> That's not what I see in (what I think is) the latest submission, which
> includes this snippet:

[ scratches head... ] OK, then I think I totally missed what this patch
is trying to accomplish; because this code looks just the same as the
existing multibyte-character operations. Where does the performance
improvement come from?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 19:57:25
Message-ID: 464CB3A5.9020600@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> Except that the entire point of this patch is to dumb down NextChar to
>>> be the same as NextByte for UTF8 strings.
>>>
>
>
>> That's not what I see in (what I think is) the latest submission, which
>> includes this snippet:
>>
>
> [ scratches head... ] OK, then I think I totally missed what this patch
> is trying to accomplish; because this code looks just the same as the
> existing multibyte-character operations. Where does the performance
> improvement come from?
>
>
>

That's what bothered me. The trouble is that we have so much code that
looks *almost* identical.

From my WIP patch, here's where the difference appears to be - note
that UTF8 branch has two NextByte calls at the bottom, unlike the other
branch:

#ifdef UTF8_OPT
/*
* UTF8 is optimised to do byte at a time matching in most cases,
* thus saving expensive calls to NextChar.
*
* UTF8 has disjoint representations for first-bytes and
* not-first-bytes of MB characters, and thus it is
* impossible to make a false match in which an MB pattern
* character is matched to the end of one data character
* plus the start of another.
* In character sets without that property, we have to use the
* slow way to ensure we don't make out-of-sync matches.
*/
else if (*p == '_')
{
NextChar(t, tlen);
NextByte(p, plen);
continue;
}
else if (!BYTEEQ(t, p))
{
/*
* Not the single-character wildcard and no explicit match? Then
* time to quit...
*/
return LIKE_FALSE;
}

NextByte(t, tlen);
NextByte(p, plen);
#else
/*
* Branch for non-utf8 multi-byte charsets and also for single-byte
* charsets which don't gain any benefit from the above
optimisation.
*/

else if ((*p != '_') && !CHAREQ(t, p))
{
/*
* Not the single-character wildcard and no explicit match? Then
* time to quit...
*/
return LIKE_FALSE;
}

NextChar(t, tlen);
NextChar(p, plen);

#endif /* UTF8_OPT */

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 21:18:35
Message-ID: 13130.1179436715@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> From my WIP patch, here's where the difference appears to be - note
> that UTF8 branch has two NextByte calls at the bottom, unlike the other
> branch:

Oh, I see: NextChar is still "real" but the patch is willing to have t
and p pointing into the middle of an MB character. That's a bit
risky. I think it works but it's making at least the following
undocumented assumptions:

* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters). While that works
for backend-safe encodings, it seems a bit strange; you've already paid
the price of determining the character length once, not to mention
matching the bytes of the characters once, and then throw that knowledge
away. I think BYTEEQ would make more sense in the backslash path.

* At pattern % or _, it's critical that we do NextChar not NextByte
on the data side. Else t is pointing into the middle of an MB sequence
when p isn't, and we have various out-of-sync conditions to worry about,
notably possibly calling NextChar when t is not pointing at the first
byte of the character, which will result in a wrong answer about the
character length.

* We *must* do NextChar not NextByte for _ since we have to match it to
exactly one logical character, not byte. You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.

So the actual optimization here is that we do bytewise comparison and
advancing, but only when we are either at the start of a character
(on both sides, and the pattern char is not wildcard) or we are in the
middle of a character (on both sides) and we've already proven that both
sides matched for the previous byte(s) of the character.

On the strength of this closer reading, I would say that the patch isn't
relying on UTF8's first-byte-vs-not-first-byte property after all.
All that it's relying on is that no MB character is a prefix of another
one, which seems like a necessary property for any sane encoding; plus
that characters are considered equal only if they're bytewise equal.
So are we sure it doesn't work for non-UTF8 encodings? Maybe that
earlier conclusion was based on a misunderstanding of what the patch
really does.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 22:24:59
Message-ID: 464CD63B.7000609@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
>
> * At a pattern backslash, it applies CHAREQ() but then advances
> byte-by-byte over the matched characters (implicitly assuming that none
> of these bytes will look like the magic characters). While that works
> for backend-safe encodings, it seems a bit strange; you've already paid
> the price of determining the character length once, not to mention
> matching the bytes of the characters once, and then throw that knowledge
> away. I think BYTEEQ would make more sense in the backslash path.
>

Probably, although the use of CHAREQ is in the present code.

Is it legal to follow escape by anything other than _ % or escape?

>
> So the actual optimization here is that we do bytewise comparison and
> advancing, but only when we are either at the start of a character
> (on both sides, and the pattern char is not wildcard) or we are in the
> middle of a character (on both sides) and we've already proven that both
> sides matched for the previous byte(s) of the character.
>

I think that's correct.

> On the strength of this closer reading, I would say that the patch isn't
> relying on UTF8's first-byte-vs-not-first-byte property after all.
> All that it's relying on is that no MB character is a prefix of another
> one, which seems like a necessary property for any sane encoding; plus
> that characters are considered equal only if they're bytewise equal.
> So are we sure it doesn't work for non-UTF8 encodings? Maybe that
> earlier conclusion was based on a misunderstanding of what the patch
> really does.
>
>
>

Indeed.

One more thing - I'm thinking of rolling up the bytea matching routines
as well as the text routines to eliminate all the duplication of logic.
I can do it by a little type casting from bytea* to text* and back
again, or if that's not acceptable by some preprocessor magic. I think
the casting is likely to be safe enough in this case - I don't think a
null byte will hurt us anywhere in this code - and presumably the
varlena stuff is all the same. Does that sound reasonable?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 23:45:50
Message-ID: 15910.1179445550@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Is it legal to follow escape by anything other than _ % or escape?

Certainly, but once you've compared the first byte you can handle any
remaining bytes via the main loop. And in fact the code is already
depending on being able to do that --- the use of CHAREQ rather than
BYTEEQ is just wasted cycles.

> One more thing - I'm thinking of rolling up the bytea matching routines
> as well as the text routines to eliminate all the duplication of logic.

+1, I think, but I wonder why we had the duplication in the first
place. Is there any likelihood that bytea's semantics should diverge
from the text case?

regards, tom lane


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-18 02:07:20
Message-ID: 20070518105529.91CB.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> On the strength of this closer reading, I would say that the patch isn't
> relying on UTF8's first-byte-vs-not-first-byte property after all.
> All that it's relying on is that no MB character is a prefix of another
> one, which seems like a necessary property for any sane encoding; plus
> that characters are considered equal only if they're bytewise equal.
> So are we sure it doesn't work for non-UTF8 encodings? Maybe that
> earlier conclusion was based on a misunderstanding of what the patch
> really does.

Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those encodings; If we have '[AB][AB]' in a table and
search it with LIKE '%[BA]%', we judge that they are matched by mistake.

I've also misunderstood it before, and Dennis Bjorklund pointed out.
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01377.php

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-18 02:35:06
Message-ID: 17562.1179455706@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Yes, I only used the 'disjoint representations for first-bytes and
> not-first-bytes of MB characters' feature in UTF8. Other encodings
> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
> not cope with those encodings; If we have '[AB][AB]' in a table and
> search it with LIKE '%[BA]%', we judge that they are matched by mistake.

AFAICS, the patch does *not* make that mistake because % will not
advance over a fractional character.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-18 03:06:05
Message-ID: 464D181D.9010307@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
>
>> Yes, I only used the 'disjoint representations for first-bytes and
>> not-first-bytes of MB characters' feature in UTF8. Other encodings
>> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
>> not cope with those encodings; If we have '[AB][AB]' in a table and
>> search it with LIKE '%[BA]%', we judge that they are matched by mistake.
>>
>
> AFAICS, the patch does *not* make that mistake because % will not
> advance over a fractional character.
>
>

Yeah, I think that's right.

Attached is my current WIP patch. If we decide that this optimisation
can in fact be applied to all backend encodings, that will be easily
incorporated. It will simplify the code further. Note that all the
common code in the MatchText and do_like_escape functions has been
factored - and the bytea functions just call the single-byte text
versions - AFAICS the effect will be identical to having the specialised
versions. (I'm always happy when code volume can be reduced.)

cheers

andrew

Attachment Content-Type Size
utf8.patch text/x-patch 24.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-18 03:31:22
Message-ID: 18822.1179459082@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Attached is my current WIP patch.

A few quick eyeball comments:

> ! static __inline__ int

Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler. Use "inline", which we #define appropriately
at need.

> ! * UTF8 has disjoint representations for first-bytes and
> ! * not-first-bytes of MB characters, and thus it is
> ! * impossible to make a false match in which an MB pattern
> ! * character is matched to the end of one data character
> ! * plus the start of another.
> ! * In character sets without that property, we have to use the
> ! * slow way to ensure we don't make out-of-sync matches.

I thought we'd concluded that this explanation is pseudo-science?

> ! * Branch for non-utf8 multi-byte charsets and also for single-byte
> ! * charsets which don't gain any benefir from the above optimisation.

spellcheck...

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-18 03:42:45
Message-ID: 464D20B5.2030104@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Under *no* circumstances use __inline__, as it will certainly break
> every non-gcc compiler. Use "inline", which we #define appropriately
> at need.
>

OK. (this was from upstream patch.)
>
> I thought we'd concluded that this explanation is pseudo-science?
>
>
[...]
>
> spellcheck...
>
>
>

Right. I'm waiting on a consensus about the UTF8-ness of the solution
before I adjust comments.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-18 15:43:49
Message-ID: 464DC9B5.8050803@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
>
>> Yes, I only used the 'disjoint representations for first-bytes and
>> not-first-bytes of MB characters' feature in UTF8. Other encodings
>> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
>> not cope with those encodings; If we have '[AB][AB]' in a table and
>> search it with LIKE '%[BA]%', we judge that they are matched by mistake.
>>
>
> AFAICS, the patch does *not* make that mistake because % will not
> advance over a fractional character.
>
>
>

Unless I hear differently, my present intention is to apply the
suggested improvement universally. I'll wait a day or two before
completing the patch.

cheers

andrew


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 07:44:54
Message-ID: 464FFC76.9060308@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane skrev:
> You could imagine trying to do
> % a byte at a time (and indeed that's what I'd been thinking it did)
> but that gets you out of sync which breaks the _ case.

It is only when you have a pattern like '%_' when this is a problem and
we could detect this and do byte by byte when it's not. Now we check (*p
== '\\') || (*p == '_') in each iteration when we scan over characters
for '%', and we could do it once and have different loops for the two cases.

Other than this part that I think can be optimized I don't see anything
wrong with the idea behind the patch. To make the '%' case fast might be
an important optimization for a lot of use cases. It's not uncommon that
'%' matches a bigger part of the string than the rest of the pattern.

It's easy to make a misstake when one is used to think about the simple
fixed size characters like ascii. Strange that this simple topic can be
so difficult to think about... :-)

/Dennis


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 13:28:29
Message-ID: 46504CFD.5040505@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Dennis Bjorklund wrote:
> Tom Lane skrev:
>> You could imagine trying to do
>> % a byte at a time (and indeed that's what I'd been thinking it did)
>> but that gets you out of sync which breaks the _ case.
>
> It is only when you have a pattern like '%_' when this is a problem
> and we could detect this and do byte by byte when it's not. Now we
> check (*p == '\\') || (*p == '_') in each iteration when we scan over
> characters for '%', and we could do it once and have different loops
> for the two cases.
>
> Other than this part that I think can be optimized I don't see
> anything wrong with the idea behind the patch. To make the '%' case
> fast might be an important optimization for a lot of use cases. It's
> not uncommon that '%' matches a bigger part of the string than the
> rest of the pattern.
>

Are you sure? The big remaining char-matching bottleneck will surely be
in the code that scans for a place to start matching a %. But that's
exactly where we can't use byte matching for cases where the charset
might include AB and BA as characters - the pattern might contain %BA
and the string AB. However, this isn't a danger for UTF8, which leads me
to think that we do indeed need a special case for UTF8, but for a
different improvement from that proposed in the original patch. I'll
post an updated patch shortly.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 14:11:16
Message-ID: 46505704.8040203@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
>
>
>>
>> It is only when you have a pattern like '%_' when this is a problem
>> and we could detect this and do byte by byte when it's not. Now we
>> check (*p == '\\') || (*p == '_') in each iteration when we scan over
>> characters for '%', and we could do it once and have different loops
>> for the two cases.
>>
>> Other than this part that I think can be optimized I don't see
>> anything wrong with the idea behind the patch. To make the '%' case
>> fast might be an important optimization for a lot of use cases. It's
>> not uncommon that '%' matches a bigger part of the string than the
>> rest of the pattern.
>>
>
>
> Are you sure? The big remaining char-matching bottleneck will surely
> be in the code that scans for a place to start matching a %. But
> that's exactly where we can't use byte matching for cases where the
> charset might include AB and BA as characters - the pattern might
> contain %BA and the string AB. However, this isn't a danger for UTF8,
> which leads me to think that we do indeed need a special case for
> UTF8, but for a different improvement from that proposed in the
> original patch. I'll post an updated patch shortly.
>

Here is a patch that implements this. Please analyse for possible breakage.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 14:21:58
Message-ID: 46505986.1020005@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


oops. patch attached this time

Andrew Dunstan wrote:
>
>
> I wrote:
>>
>>
>>>
>>> It is only when you have a pattern like '%_' when this is a problem
>>> and we could detect this and do byte by byte when it's not. Now we
>>> check (*p == '\\') || (*p == '_') in each iteration when we scan
>>> over characters for '%', and we could do it once and have different
>>> loops for the two cases.
>>>
>>> Other than this part that I think can be optimized I don't see
>>> anything wrong with the idea behind the patch. To make the '%' case
>>> fast might be an important optimization for a lot of use cases. It's
>>> not uncommon that '%' matches a bigger part of the string than the
>>> rest of the pattern.
>>>
>>
>>
>> Are you sure? The big remaining char-matching bottleneck will surely
>> be in the code that scans for a place to start matching a %. But
>> that's exactly where we can't use byte matching for cases where the
>> charset might include AB and BA as characters - the pattern might
>> contain %BA and the string AB. However, this isn't a danger for UTF8,
>> which leads me to think that we do indeed need a special case for
>> UTF8, but for a different improvement from that proposed in the
>> original patch. I'll post an updated patch shortly.
>>
>
> Here is a patch that implements this. Please analyse for possible
> breakage.
>
> cheers
>
> andrew
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>

Attachment Content-Type Size
utf8.patch text/x-patch 25.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 16:58:05
Message-ID: 2132.1179680285@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Are you sure? The big remaining char-matching bottleneck will surely
> be in the code that scans for a place to start matching a %. But
> that's exactly where we can't use byte matching for cases where the
> charset might include AB and BA as characters - the pattern might
> contain %BA and the string AB. However, this isn't a danger for UTF8,
> which leads me to think that we do indeed need a special case for
> UTF8, but for a different improvement from that proposed in the
> original patch. I'll post an updated patch shortly.

> Here is a patch that implements this. Please analyse for possible
> breakage.

On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?

It strikes me that we may be overcomplicating matters in another way
too. If you believe that the %-scan code is now the bottleneck, that
is, the key loop is where we have pattern '%foo' and we are trying to
match 'f' to each successive data position, then you should be bothered
that SB_MatchTextIC is applying tolower() to 'f' again for each data
character. Worst-case we could have O(N^2) applications of tolower()
during a match. I think there's a fair case to be made that we should
get rid of SB_MatchTextIC and implement *all* the case-insensitive
variants by means of an initial lower() call. This would leave us with
just two match functions and allow considerable unification of the setup
logic.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 19:45:54
Message-ID: 4650A572.2090109@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> On the strength of this analysis, shouldn't we drop the separate
> UTF8 match function and just use SB_MatchText for UTF8?
>

Possibly - IIRC I looked at that and there was some reason I didn't, but
I'll look again.

> It strikes me that we may be overcomplicating matters in another way
> too. If you believe that the %-scan code is now the bottleneck, that
> is, the key loop is where we have pattern '%foo' and we are trying to
> match 'f' to each successive data position, then you should be bothered
> that SB_MatchTextIC is applying tolower() to 'f' again for each data
> character. Worst-case we could have O(N^2) applications of tolower()
> during a match. I think there's a fair case to be made that we should
> get rid of SB_MatchTextIC and implement *all* the case-insensitive
> variants by means of an initial lower() call. This would leave us with
> just two match functions and allow considerable unification of the setup
> logic.
>
>
>

Yeah, quite possibly. I'm also wondering if we are wasting effort
downcasing what will in most cases be the same pattern over and over
again. Maybe we need to look at memoizing that somehow, or at least test
to see if that would be a gain.

We're getting quite a long way from the original patch :-)

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 21:50:44
Message-ID: 17386.1179697844@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Yeah, quite possibly. I'm also wondering if we are wasting effort
> downcasing what will in most cases be the same pattern over and over
> again. Maybe we need to look at memoizing that somehow, or at least test
> to see if that would be a gain.

Someone (Itagaki-san IIRC) suggested that we ought to convert
"x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early
stage, definitely before constant-folding in the planner. That
would take care of that issue without any run-time mechanism,
and would open opportunities for making use of an index on lower(x).

I recall thinking at the time that there were some potential downsides,
but right at the moment I'm darned if I can see any --- especially
if we're going to make ILIKE do this uniformly at runtime anyway.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 21:58:32
Message-ID: 4650C488.4060202@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Yeah, quite possibly. I'm also wondering if we are wasting effort
>> downcasing what will in most cases be the same pattern over and over
>> again. Maybe we need to look at memoizing that somehow, or at least test
>> to see if that would be a gain.
>>
>
> Someone (Itagaki-san IIRC) suggested that we ought to convert
> "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early
> stage, definitely before constant-folding in the planner. That
> would take care of that issue without any run-time mechanism,
> and would open opportunities for making use of an index on lower(x).
>
> I recall thinking at the time that there were some potential downsides,
> but right at the moment I'm darned if I can see any --- especially
> if we're going to make ILIKE do this uniformly at runtime anyway.
>
>
>
Sounds like a TODO item. I'm already concerned a bit about scope creep
for this item.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 22:02:10
Message-ID: 17534.1179698530@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>> Yeah, quite possibly. I'm also wondering if we are wasting effort
>>> downcasing what will in most cases be the same pattern over and over
>>> again. Maybe we need to look at memoizing that somehow, or at least test
>>> to see if that would be a gain.
>>
>> Someone (Itagaki-san IIRC) suggested that we ought to convert
>> "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early
>> stage, definitely before constant-folding in the planner.
>>
> Sounds like a TODO item. I'm already concerned a bit about scope creep
> for this item.

Agreed, I don't want to tackle this right now --- I'm just suggesting
it's probably a better answer than memoizing at runtime.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 22:11:13
Message-ID: 4650C781.8050400@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
>
> On the strength of this analysis, shouldn't we drop the separate
> UTF8 match function and just use SB_MatchText for UTF8?
>

We still call NextChar() after "_", and I think we probably need to,
don't we? If so we can't just marry the cases.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 22:23:50
Message-ID: 17738.1179699830@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> On the strength of this analysis, shouldn't we drop the separate
>> UTF8 match function and just use SB_MatchText for UTF8?

> We still call NextChar() after "_", and I think we probably need to,
> don't we? If so we can't just marry the cases.

Doh, you're right ... but on third thought, what happens with a pattern
containing "%_"? If % tries to advance bytewise then we'll be trying to
apply NextChar in the middle of a data character, and bad things ensue.

I think we need to go back to the scheme with SB_ and MB_ variants and
no special case for UTF8.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-20 22:43:37
Message-ID: 4650CF19.1040202@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> On the strength of this analysis, shouldn't we drop the separate
>>> UTF8 match function and just use SB_MatchText for UTF8?
>>>
>
>
>> We still call NextChar() after "_", and I think we probably need to,
>> don't we? If so we can't just marry the cases.
>>
>
> Doh, you're right ... but on third thought, what happens with a pattern
> containing "%_"? If % tries to advance bytewise then we'll be trying to
> apply NextChar in the middle of a data character, and bad things ensue.
>
> I think we need to go back to the scheme with SB_ and MB_ variants and
> no special case for UTF8.
>
>
>

My head is spinning with all these variants. I'll look at ti tomorrow.

cheers

andrew


From: db(at)zigo(dot)dhs(dot)org
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Dennis Bjorklund" <db(at)zigo(dot)dhs(dot)org>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "Bruce Momjian" <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-21 05:58:04
Message-ID: 57644.192.121.104.48.1179727084.squirrel@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> Doh, you're right ... but on third thought, what happens with a pattern
> containing "%_"? If % tries to advance bytewise then we'll be trying to
> apply NextChar in the middle of a data character, and bad things ensue.

Right, when you have '_' after a '%' you need to make sure the '%'
advances full characters. In my suggestion the test if '_' (or '\') come
after the '%' is done once and it select which of the two loops to use,
the one that do byte stepping or the one with NextChar.

It's difficult to know for sure that we have thought about all the corner
cases. I hope the gain is worth the effort.. :-)

/Dennis


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: db(at)zigo(dot)dhs(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-21 13:34:23
Message-ID: 46519FDF.5070302@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

db(at)zigo(dot)dhs(dot)org wrote:
>> Doh, you're right ... but on third thought, what happens with a pattern
>> containing "%_"? If % tries to advance bytewise then we'll be trying to
>> apply NextChar in the middle of a data character, and bad things ensue.
>>
>
> Right, when you have '_' after a '%' you need to make sure the '%'
> advances full characters. In my suggestion the test if '_' (or '\') come
> after the '%' is done once and it select which of the two loops to use,
> the one that do byte stepping or the one with NextChar.
>
> It's difficult to know for sure that we have thought about all the corner
> cases. I hope the gain is worth the effort.. :-)
>
>
>

Yes, I came to the same conclusion about how to restructure the code.

The current code contains this:

while (tlen > 0)
{
/*
* Optimization to prevent most recursion: don't recurse
* unless first pattern char might match this text char.
*/
if (CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
{
int matched = MatchText(t, tlen, p, plen);

if (matched != LIKE_FALSE)
return matched; /* TRUE or ABORT */
}

NextChar(t, tlen);
}

The code appears to date from v 1.23 of like.c way back in 2001. I'm not
sure I agree with the comment, though. In the first place, the invariant
tests should not be in the loop, I think, and I'll hoist them out as
Dennis suggests. But why are we doing that CHAREQ? If it succeeds we'll
just do it again when we recurse, I think.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: db(at)zigo(dot)dhs(dot)org, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-21 13:43:47
Message-ID: 10140.1179755027@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> But why are we doing that CHAREQ?

To avoid the cost of the recursive call, just like it says.

> If it succeeds we'll
> just do it again when we recurse, I think.

If you move the other two cases then you could advance t and p before
entering the recursion.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: db(at)zigo(dot)dhs(dot)org, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-21 16:59:33
Message-ID: 4651CFF5.6010709@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> But why are we doing that CHAREQ?
>>
>
> To avoid the cost of the recursive call, just like it says.
>
>
>> If it succeeds we'll
>> just do it again when we recurse, I think.
>>
>
> If you move the other two cases then you could advance t and p before
> entering the recursion.
>
>
>

Yeah. Since I have removed the "_" case I believe it's now safe there to
use BYTEEQ/NextByte, and since they are sufficiently cheap it's not
worth worrying about.

Attached is a patch version that I think draws together all the threads
of discussion so far. It's in fact quite a lot simpler than the existing
code, with no special UTF8 case - this should improve LIKE/ILIKE
processing for all charsets.

More eyeballs please for nasty corner cases.

cheers

andrew

Attachment Content-Type Size
like.patch text/x-patch 25.5 KB