make_greater_string() does not return a string in some cases

Lists: pgsql-bugs
From: Tatsuhito Kasahara <kasahara(dot)tatsuhito(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Subject: make_greater_string() does not return a string in some cases
Date: 2010-06-21 10:02:03
Message-ID: 4C1F389B.1010606@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi !

make_greater_string() does not return a string when some UTF8 strings
set to str_const.
# Especially UTF8 strings which contains 'BF' in last byte.

Because make_greater_string() only try incrementing the last byte of
the string, and not try same test for upper bytes.

Therefore, some queries which contains "LIKE '<contains 'BF' in last byte>%'"
can not perform (Btree's) index-scan.
# Or may be nearly full-index-scan.

# See follwing example.
===============================================================================
'西' (Japanese Letter) : 0xE8A5BF

[client : UTF8 ⇔ server : EUC_JP]
=# EXPLAIN ANALYZE SELECT * FROM test2 WHERE name LIKE '西%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using test2_name on test2 (cost=0.00..8.28 rows=1 width=3) (actual time=0.077..0.078 rows=1 loops=1)
Index Cond: ((name >= '西'::text) AND (name < '誠'::text)) <-- Index-scan is chosen
Filter: (name ~~ '西%'::text)
Total runtime: 0.110 ms
(4 rows)

[client : UTF8 ⇔ server : UTF8]
=# EXPLAIN ANALYZE SELECT * FROM test2 WHERE name LIKE '西%';
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on test2 (cost=0.00..1693.01 rows=1 width=4) (actual time=22.598..22.599 rows=1 loops=1)
Filter: (name ~~ '西%'::text) <-- Seq-scan is chosen !
Total runtime: 22.626 ms
(3 rows)
===============================================================================

Attached patch solve above problem.

Best regards,

--
NTT OSS Center
Tatsuhito Kasahara

Attachment Content-Type Size
make_greater_string.patch text/plain 1.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuhito Kasahara <kasahara(dot)tatsuhito(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: make_greater_string() does not return a string in some cases
Date: 2010-06-21 16:10:05
Message-ID: 13689.1277136605@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tatsuhito Kasahara <kasahara(dot)tatsuhito(at)oss(dot)ntt(dot)co(dot)jp> writes:
> make_greater_string() does not return a string when some UTF8 strings
> set to str_const.
> # Especially UTF8 strings which contains 'BF' in last byte.

The patch you propose for this is really untenable: it will re-introduce
many corner cases that we got rid of years ago, for example cases
wherein pg_verifymbstr and pg_mbcliplen index off the end of the string
because they think the last character occupies more bytes than are
there. It's intentional that the existing code doesn't mess with the
first byte of a multibyte character (which is the one that determines
the character length, in all encodings of interest).

Another problem is that if the last character is several bytes long,
this coding would cause us to iterate through potentially many millions
of character values before giving up and truncating off the last
character. In a large number of cases that's just wasted time because
there is no chance of getting a larger string without incrementing some
character further to the left. So there's a tradeoff that limits how
many values we should consider for each character position --- choosing
to consider at most 255 is a bit arbitrary, but "all of them" isn't
going to work.

I don't think that the set of cases that could be improved this way is
large enough to justify trying to find solutions to these problems.

regards, tom lane


From: Tatsuhito Kasahara <kasahara(dot)tatsuhito(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: make_greater_string() does not return a string in some cases
Date: 2010-06-23 09:42:46
Message-ID: 4C21D716.70808@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> The patch you propose for this is really untenable: it will re-introduce
> many corner cases that we got rid of years ago, for example cases
> wherein pg_verifymbstr and pg_mbcliplen index off the end of the string
> because they think the last character occupies more bytes than are
> there.

> Another problem is that if the last character is several bytes long,
> this coding would cause us to iterate through potentially many millions
> of character values before giving up and truncating off the last
> character.
Hmm... OK, I see your points.

I have another idea.

1. We prepare new operators ( <,<=,>,=>,= ) for text and bytea.
2. In make_greater_string(), if
multi-byte-string was set and
using locale-C and
could not find greater string,
returns bytea which has greater byte-code of last-character.

User will get the following result.

=======================================================================================================
-- 西 : 0xe8a5bf
=# EXPLAIN ANALYZE SELECT * FROM test WHERE name LIKE '西%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Index Scan using test_name on test (cost=0.00..8.28 rows=1 width=4) (actual time=0.022..0.024 rows=1 loops=1)
Index Cond: ((name >= '西'::text) AND (name < '\\xe8a5c0'::bytea))
Filter: (name ~~ '西%'::text)
Total runtime: 0.053 ms
(4 rows)
=======================================================================================================

Is the idea reasonable ?

Best regards,

--
NTT OSS Center
Tatsuhito Kasahara


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuhito Kasahara <kasahara(dot)tatsuhito(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: make_greater_string() does not return a string in some cases
Date: 2010-06-23 22:23:39
Message-ID: 29535.1277331819@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tatsuhito Kasahara <kasahara(dot)tatsuhito(at)oss(dot)ntt(dot)co(dot)jp> writes:
> I have another idea.

> 1. We prepare new operators ( <,<=,>,=>,= ) for text and bytea.
> 2. In make_greater_string(), if
> multi-byte-string was set and
> using locale-C and
> could not find greater string,
> returns bytea which has greater byte-code of last-character.

> Is the idea reasonable ?

Maybe, but it only works for text_pattern_ops indexes not normal ones.
Not sure if people will be happy with maintaining a special index just
to cover this corner case.

I'm not convinced that there's enough of a problem here to be worth
sweating over. If we're not able to generate a "greater" string with
the current rules, the odds are that the pattern is so close to the end
of the index range that a one-sided test is not going to make much
difference compared to a two-sided one.

regards, tom lane