Re: make_greater_string() does not return a string in some cases

Lists: pgsql-bugspgsql-hackers
From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: make_greater_string() does not return a string in some cases
Date: 2011-07-08 09:21:16
Message-ID: 20110708.182116.44187733.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello, Could you let me go on with this topic?

It is hard to ignore this glitch for us using CJK - Chinese,
Japanese, and Korean - characters on databse.. Maybe..

Saying on Japanese under the standard usage, about a hundred
characters out of seven thousand make make_greater_string() fail.

This is not so frequent to happen but also not as rare as
ignorable.

I think this glitch is caused because the method to derive the
`next character' is fundamentally a secret of each encoding but
now it is done in make_greater_string() using the method extended
from that of 1 byte ASCII charset for all encodings together.

So, I think it is reasonable that encoding info table (struct
pg_wchar_tbl) holds the function to do that.

How about this idea?

Points to realize this follows,

- pg_wchar_tbl(at)pg_wchar(dot)c has new element `charinc' that holds a
function to increment a character of this encoding.

- Basically, the value of charinc is a `generic' increment
function that does what make_greater_string() does in current
implement.

- make_greater_string() now uses charinc for database encoding to
increment characters instead of the code directly written in
it.

- Give UTF-8 a special increment function.

As a consequence of this modification, make_greater_string()
looks somewhat simple thanks to disappearing of the sequence that
handles bare bytes in string. And doing `increment character'
with the knowledge of the encoding can be straightforward and
light and backtrack-free, and have fewer glitches than the
generic method.

# But the process for BYTEAOID remains there dissapointingly.

There still remains some glitches but I think it is overdo to do
conversion that changes the length of the character. Only 5
points out of 17 thousands (in current method, roughly for all
BMP characters) remains, and none of them are not Japanese
character :-)

The attached patch is sample implement of this idea.

What do you think about this patch?

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
unknown_filename text/plain 16.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: make_greater_string() does not return a string in some cases
Date: 2011-07-09 03:28:32
Message-ID: CA+TgmoYAoGSnk1zMEeq_RWB+D5OjfPX6zmXFetoSZfW70S1p6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Jul 8, 2011 at 5:21 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Points to realize this follows,

Please add your patch to the next CommitFest.

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] make_greater_string() does not return a string in some cases
Date: 2011-07-11 06:55:31
Message-ID: 20110711.155531.15100243.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thanks for your suggestion, I'll do so.

At Fri, 8 Jul 2011 23:28:32 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Please add your patch to the next CommitFest.
>
> https://commitfest.postgresql.org/action/commitfest_view/open
--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: make_greater_string() does not return a string in some cases
Date: 2011-07-12 07:12:31
Message-ID: 20110712.161231.138791411.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

This is an update of a patch for NEXT CommitFest 2011/09.

Please ignore this message.

1 Additional Feature - EUC-JP incrementer
2 Bug fixes - bytea incrementer, libpq compilation.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
unknown_filename text/plain 19.3 KB

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-14 02:13:20
Message-ID: 20110914.111320.18404009.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

This is rebased patch of `Allow encoding specific character
incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602).

Addition to the patch, increment sanity check program for new
functions pg_generic_charinc and pg_utf8_increment is attached.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
charinc.patch text/x-patch 19.0 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 00:35:02
Message-ID: CA+TgmoYO3UbyCekzDTdVWTG2goEQtr7H_f=9mous9u_uH_StSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Sep 13, 2011 at 10:13 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> This is rebased patch of `Allow encoding specific character
> incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602).
>
> Addition to the patch, increment sanity check program for new
> functions pg_generic_charinc and pg_utf8_increment is attached.

I took a look at this patch ... and the thread ... and the previous
thread with the same subject line:

http://archives.postgresql.org/pgsql-bugs/2010-06/msg00303.php

As I understand it, the issue here is that when we try to generate
suitable > and < quals for a LIKE expression, we need to find a string
which is greater than the prefix we're searching for, and the existing
algorithm sometimes fails. But there seems to be some dispute over
how likely this is to occur. Tom argues that the case is so rare that
we shouldn't worry about it:

http://archives.postgresql.org/pgsql-bugs/2010-06/msg00336.php

...while Kyotaro Horiguchi clearly feels otherwise, citing the
statistic that about 100 out of 7000 Japanese characters fail to work
properly:

http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php

That statistic seems to justify some action, but what? Ideas:

1. Adopt the patch as proposed, or something like it.
2. Instead of installing encoding-specific character incrementing
functions, we could try to come up with a more reliable generic
algorithm. Not sure exactly what, though.
3. Come up with some way to avoid needing to do this in the first place.

One random idea I have is - instead of generating > and < clauses,
could we define a "prefix match" operator - i.e. a ### b iff substr(a,
1, length(b)) = b? We'd need to do something about the selectivity,
but I don't see why that would be a problem.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 01:49:27
Message-ID: 6425.1316656167@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> As I understand it, the issue here is that when we try to generate
> suitable > and < quals for a LIKE expression, we need to find a string
> which is greater than the prefix we're searching for, and the existing
> algorithm sometimes fails. But there seems to be some dispute over
> how likely this is to occur. Tom argues that the case is so rare that
> we shouldn't worry about it:
> http://archives.postgresql.org/pgsql-bugs/2010-06/msg00336.php

Well, I didn't like the amount of overhead that was implied by that
proposal. I don't have a great deal of difficulty with the idea of
encoding-specific character incrementers, especially since they could
presumably be more efficient than the current brute force approach
wherein we call pg_verifymbstr on the entire string after mangling
just the last byte.

The main risk that I foresee with the proposed approach is that if you
have, say, a 4-byte final character, you could iterate through a *whole
lot* (millions) of larger encoded characters, with absolutely no hope of
making a greater string that way when the determining difference occurs
at some earlier character. And then when you do run out, you could
waste just as much time at the immediately preceding character, etc etc.
The existing algorithm is a compromise between thoroughness of
investigation of different values of the last character versus speed of
falling back to change the preceding character instead. I'd be the
first to say that incrementing only the last byte is a very
quick-and-dirty heuristic for making that happen. But I think it would
be unwise to allow the thing to consider more than a few hundred values
for a character position before giving up. Possibly the
encoding-specific incrementer could be designed to run through all legal
values of the last byte, then start skipping larger and larger ranges
--- maybe just move directly to incrementing the first byte.

Aside from that issue, the submitted patch seems to need quite a lot of
detail work; for instance, I noted an unconstrained memcpy into a 4-byte
local buffer, as well as lots and lots of violations of PG house style.
That's certainly all fixable but somebody will have to go through it.

> One random idea I have is - instead of generating > and < clauses,
> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
> 1, length(b)) = b? We'd need to do something about the selectivity,
> but I don't see why that would be a problem.

The problem is that you'd need to make that a btree-indexable operator.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 03:04:02
Message-ID: CA+TgmoY9+ZUHNju7ERi7QM36FjiXzfT38S63564L_14qf3hgYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, Sep 21, 2011 at 9:49 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The main risk that I foresee with the proposed approach is that if you
> have, say, a 4-byte final character, you could iterate through a *whole
> lot* (millions) of larger encoded characters, with absolutely no hope of
> making a greater string that way when the determining difference occurs
> at some earlier character.  And then when you do run out, you could
> waste just as much time at the immediately preceding character, etc etc.
> The existing algorithm is a compromise between thoroughness of
> investigation of different values of the last character versus speed of
> falling back to change the preceding character instead.  I'd be the
> first to say that incrementing only the last byte is a very
> quick-and-dirty heuristic for making that happen.  But I think it would
> be unwise to allow the thing to consider more than a few hundred values
> for a character position before giving up.  Possibly the
> encoding-specific incrementer could be designed to run through all legal
> values of the last byte, then start skipping larger and larger ranges
> --- maybe just move directly to incrementing the first byte.

I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.

> Aside from that issue, the submitted patch seems to need quite a lot of
> detail work; for instance, I noted an unconstrained memcpy into a 4-byte
> local buffer, as well as lots and lots of violations of PG house style.
> That's certainly all fixable but somebody will have to go through it.

I noticed that, too, but figured we should agree on the basic approach
first. In the first instance, "someone" will hopefully be the patch
author. (Help from anyone else is also welcome, of course... we have
almost 40 patches left to crawl through and, at the moment, very few
reviewers.)

>> One random idea I have is - instead of generating > and < clauses,
>> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
>> 1, length(b)) = b?  We'd need to do something about the selectivity,
>> but I don't see why that would be a problem.
>
> The problem is that you'd need to make that a btree-indexable operator.

Well, right. Without that, there's not much point. But do you think
that's prohibitively difficult?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 04:24:04
Message-ID: 8755.1316665444@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.

Because the behavior of common collation algorithms is so wacky as to
approach stochasticity. In particular, as soon as your string contains
a mix of letter and non-letter characters, "dictionary" algorithms tend
to behave really strangely. We went around on this quite a few times
before settling on the current approach; we kept thinking exactly what
you're thinking, namely that we ought to be able to predict more than
nothing about what the collation algorithm would consider larger.

An example of the weirdness is that in en_US collation, we have

postgres=# select '/root/' < '/root0';
?column?
----------
t
(1 row)

postgres=# select '/root/t' < '/root0';
?column?
----------
f
(1 row)

It was cases like this that forced us to give up on using non-C-locale
indexes for LIKE optimization, because the derived greater-than and
less-than conditions have to be 100% correct for that. What we're still
using make_greater_string for in non-C locales is to generate estimates
about the selectivity of LIKE prefix conditions; for that, some amount
of error is tolerable and so we just live with effects like the above.
(We have to deal with the locale's sort order, whatever the heck it is,
because that's what the pg_statistic histogram is computed in.)

Now, having said that, I'm starting to wonder again why it's worth our
trouble to fool with encoding-specific incrementers. The exactness of
the estimates seems unlikely to be improved very much by doing this.

>>> One random idea I have is - instead of generating > and < clauses,
>>> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
>>> 1, length(b)) = b? We'd need to do something about the selectivity,
>>> but I don't see why that would be a problem.

>> The problem is that you'd need to make that a btree-indexable operator.

> Well, right. Without that, there's not much point. But do you think
> that's prohibitively difficult?

The problem is that you'd just be shifting all these same issues into
the btree index machinery, which is not any better equipped to cope with
them, and would not be a good place to be adding overhead.

regards, tom lane


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 11:30:29
Message-ID: 20110922.203029.185577421.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thank you for your understanding on that point.

At Wed, 21 Sep 2011 20:35:02 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote
> ...while Kyotaro Horiguchi clearly feels otherwise, citing the
> statistic that about 100 out of 7000 Japanese characters fail to work
> properly:
>
> http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php
>
> That statistic seems to justify some action, but what? Ideas:

Addition to the figures - based on whole characters defined in
JIS X 0208 which is traditionally (It is becoming a history now.)
for information exchange in Japan - narrowing to commonly-used
characters (named `Jouyou-Kanji' in Japanese, to be learned by
high school graduates in Japan), 35 out of 2100 hits.

# On the other hand, widening to JIS X 0213 which is roughly
# compatible with the Unicode, and defines more than 12K chars, I
# have not counted, but the additional 5k characters can be
# assumed to have less probability to fail than the chars in JIS
# X 0208.

> 1. Adopt the patch as proposed, or something like it.
> 2. Instead of installing encoding-specific character incrementing
> functions, we could try to come up with a more reliable generic
> algorithm. Not sure exactly what, though.
> 3. Come up with some way to avoid needing to do this in the first place.
>
> One random idea I have is - instead of generating > and < clauses,
> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
> 1, length(b)) = b? We'd need to do something about the selectivity,
> but I don't see why that would be a problem.
>
> Thoughts?

I am a newbie for PostgreSQL, but from a general view, I think
that the most radical and clean way to fix this behavior is to
make indexes to have the forward-matching function for strings in
itself, with ignoreing possible overheads I don't know. This can
save the all failures this patch has left unsaved, assuming that
the `greater string' is not necessary to be a `valid string' just
on searching btree.

Another idea that I can guess is to add a new operator that means
"examine if the string value is smaller than the `greater string'
of the parameter.". This operator also can defer making `greater
string' to just before searching btree or summing up histogram
entries, or comparison with column values. If the assumption
above is true, "making greater string" operation can be done in
regardless of character encoding. This seems have smaller impact
than "prefix match" operator.

# But, mmm, The more investigating, the less difference it seems
# for me to be... But It is out of my knowledge now, anyway.. I
# need more study.

On the other hand, if no additional encoding-specific `character
increment function' will not come out, the modification of
pg_wchar_table can be cancelled and make_greater_string will
select the `character increment function' as 'switch
(GetDatabaseEncoding()) { case PG_UTF8:.. }'. This get rid of
the pg_generic_charinc tweak for libpq too.

At Wed, 21 Sep 2011 21:49:27 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
> detail work; for instance, I noted an unconstrained memcpy into a 4-byte
> local buffer, as well as lots and lots of violations of PG house style.
> That's certainly all fixable but somebody will have to go through it.

Sorry for the illegal style of the patch. I will confirm it.

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 12:49:57
Message-ID: CA+Tgmoax-SHNgHe77cJZGsqgsB+Z=n_jzQZ5h0RG1+NcWGHkBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 22, 2011 at 12:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.
>
> [ collations suck ]

Ugh.

> Now, having said that, I'm starting to wonder again why it's worth our
> trouble to fool with encoding-specific incrementers.  The exactness of
> the estimates seems unlikely to be improved very much by doing this.

Well, so the problem is that the frequency with which the algorithm
fails altogether seems to be disturbingly high for certain kinds of
characters. I agree it might not be that important to get the
absolutely best next string, but it does seem important not to fail
outright. Kyotaro Horiguchi gives the example of UTF-8 characters
ending with 0xbf.

>>>> One random idea I have is - instead of generating > and < clauses,
>>>> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
>>>> 1, length(b)) = b?  We'd need to do something about the selectivity,
>>>> but I don't see why that would be a problem.
>
>>> The problem is that you'd need to make that a btree-indexable operator.
>
>> Well, right.  Without that, there's not much point.  But do you think
>> that's prohibitively difficult?
>
> The problem is that you'd just be shifting all these same issues into
> the btree index machinery, which is not any better equipped to cope with
> them, and would not be a good place to be adding overhead.

My thought was that it would avoid the need to do any character
incrementing at all. You could just start scanning forward as if the
operator were >= and then stop when you hit the first string that
doesn't have the same initial substring.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Stark <stark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 12:59:16
Message-ID: CAM-w4HMUJox19HkuJJK-vWNJLe0RTQK+P0O6NtfB_mF78zE5Rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 22, 2011 at 1:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> My thought was that it would avoid the need to do any character
> incrementing at all.  You could just start scanning forward as if the
> operator were >= and then stop when you hit the first string that
> doesn't have the same initial substring.

But the whole problem is that not all the strings with the initial
substring are in a contiguous block. The best we can hope for is that
they're fairly dense within a block without too many non-matching
strings. The example with / shows how that can happen.

If you're looking for foo/% and you start with foo/ you'll find:

foo/
foo0
foo/0
foo1
foo/1
...

Even just case-insensitive collations don't put all the strings with a
common prefix in a contiguous block. If you're searching for foo%
you'll find:

foo
Foobar
foobar

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 13:27:21
Message-ID: CA+TgmoZsvfs1N0-9-GsiOQxBNyPG5XxAMCVD=wOqd30G6dH9XA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 22, 2011 at 8:59 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Thu, Sep 22, 2011 at 1:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> My thought was that it would avoid the need to do any character
>> incrementing at all.  You could just start scanning forward as if the
>> operator were >= and then stop when you hit the first string that
>> doesn't have the same initial substring.
>
> But the whole problem is that not all the strings with the initial
> substring are in a contiguous block. The best we can hope for is that
> they're fairly dense within a block without too many non-matching
> strings. The example with / shows how that can happen.
>
> If you're looking for foo/% and you start with foo/ you'll find:
>
> foo/
> foo0
> foo/0
> foo1
> foo/1
> ...
>
> Even just case-insensitive collations don't put all the strings with a
> common prefix in a contiguous block. If you're searching for foo%
> you'll find:
>
> foo
> Foobar
> foobar

If that were true for the sorts of indexes we're using for LIKE
queries, the existing approach wouldn't work either. All we're doing
is translating:

a LIKE 'foo/%'

to

a ~=>~ 'foo/%' AND a ~<~ 'foo0'

...where ~=>~ and ~<~ are just text-pattern-ops versions of => and <
that ignore the normal collation rules and just compare bytes.

In general, if we wanted to get rid of text_pattern_ops and make all
of this work with arbitrary indexes, yeah, that would be very
difficult.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 13:51:12
Message-ID: 17571.1316699472@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Sep 22, 2011 at 8:59 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> But the whole problem is that not all the strings with the initial
>> substring are in a contiguous block.

> If that were true for the sorts of indexes we're using for LIKE
> queries, the existing approach wouldn't work either.

Right. Since it's not a problem for the sorts of indexes with which we
can use LIKE, moving knowledge of LIKE into the btree machinery doesn't
buy us a darn thing, except more complexity in a place where we can ill
afford it. The essential problem here is "when can you stop scanning,
given a pattern with this prefix?", and btree doesn't know any more
about that than make_greater_string does; it would in fact have to use
make_greater_string or something isomorphic to it.

regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 14:28:53
Message-ID: CAM-w4HNjwku97Es_ge6=_NDfcBEhDOgCGrHaZeOQtm7pcTd5tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 22, 2011 at 2:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The essential problem here is "when can you stop scanning,
> given a pattern with this prefix?", and btree doesn't know any more
> about that than make_greater_string does; it would in fact have to use
> make_greater_string or something isomorphic to it.

Hm, as long as btree_pattern_ops is the only opclass that behaves this
way that's more or less true. But Robert's right that if btree just
stops when it finds something that doesn't match it doesn't need to
hard code any knowledge of what the "next" value would be. If there
were any other op classes that had this abstract property of always
putting strings with common prefixes in a contiguous block then it
would continue to work without having to know where to find the
boundaries of that contiguous block.

Just as an example, if you had a pattern_ops opclass that sorted the
string assuming it was in some other encoding like, say, EBCDIC, then
make_greater_string would have to learn about it but Robert's model
would just work.

This isn't enitirely facetious. Sorting by EBCDIC ordering would be
silly but I vague recall there being some examples that wouldn't be
silly. And perhaps some collations could actually be marked as being
acceptable even if they don't sort in pure ascii ordering and
make_greater_string doesn't actually know about them.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 14:36:28
Message-ID: 18434.1316702188@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Sep 22, 2011 at 12:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Now, having said that, I'm starting to wonder again why it's worth our
>> trouble to fool with encoding-specific incrementers. The exactness of
>> the estimates seems unlikely to be improved very much by doing this.

> Well, so the problem is that the frequency with which the algorithm
> fails altogether seems to be disturbingly high for certain kinds of
> characters. I agree it might not be that important to get the
> absolutely best next string, but it does seem important not to fail
> outright. Kyotaro Horiguchi gives the example of UTF-8 characters
> ending with 0xbf.

[ thinks for a bit ... ] Yeah, it's certainly true that such a
character might be relatively small in the overall sort order. The
assumption underlying what we're doing now is that dropping the last
character and incrementing the next-to-last one instead isn't terribly
catastrophic from an estimation accuracy standpoint. I can see that
there are cases where that would fail to be true, but I'm not exactly
convinced that they're worse than all the other cases where we'll get
a poor estimate.

Anyway, I won't stand in the way of the patch as long as it's modified
to limit the number of values considered for any one character position
to something reasonably small.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 14:44:05
Message-ID: 18585.1316702645@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> On Thu, Sep 22, 2011 at 2:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The essential problem here is "when can you stop scanning,
>> given a pattern with this prefix?", and btree doesn't know any more
>> about that than make_greater_string does; it would in fact have to use
>> make_greater_string or something isomorphic to it.

> Hm, as long as btree_pattern_ops is the only opclass that behaves this
> way that's more or less true. But Robert's right that if btree just
> stops when it finds something that doesn't match it doesn't need to
> hard code any knowledge of what the "next" value would be.

But you've added mechanism (and hence cycles) to btree searches,
and *you haven't actually gained anything*. If the feature is
restricted to only work for sort orderings in which common-prefix
strings are contiguous, then it doesn't do anything we can't do
just as well with the existing mechanism. Moreover, you'll still
need make_greater_string because of the problem of trying to extract
LIKE selectivity estimates from locale-dependent pg_statistic data.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 15:26:56
Message-ID: CA+Tgmoame4LOO=uF2G=sNMQgSDwXcwfrsg+gNbT64r=rTux-0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Anyway, I won't stand in the way of the patch as long as it's modified
> to limit the number of values considered for any one character position
> to something reasonably small.

One thing I was thinking about is that it would be useful to have some
metric for judging how well any given algorithm that we might pick
here actually works. For example, if we were to try all possible
three character strings in some encoding and run make_greater_string()
on each one of them, we could then measure the failure percentage. Or
if that's too many cases to crank through then we could limit it some
way - but the point is, without some kind of test harness here, we
have no way of measuring the trade-off between spending more CPU time
and improving accuracy. Maybe you have a better feeling for what's
reasonable there than I do, but I'm not prepared to take a stab in the
dark without benefit of some real measurements.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 15:46:43
Message-ID: 21348.1316706403@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> One thing I was thinking about is that it would be useful to have some
> metric for judging how well any given algorithm that we might pick
> here actually works.

Well, the metric that we were indirectly using earlier was the
number of characters in a given locale for which the algorithm
fails to find a greater one (excluding whichever character is "last",
I guess, or you could just recognize there's always at least one).

> For example, if we were to try all possible
> three character strings in some encoding and run make_greater_string()
> on each one of them, we could then measure the failure percentage. Or
> if that's too many cases to crank through then we could limit it some
> way -

Even in UTF8 there's only a couple million assigned code points, so for
test purposes anyway it doesn't seem like we couldn't crank through them
all. Also, in many cases you could probably figure it out by analysis
instead of brute-force testing every case.

A more reasonable objection might be that a whole lot of those code
points are things nobody cares about, and so we need to weight the
results somehow by the actual popularity of the character. Not sure
how to take that into account.

Another issue here is that we need to consider not just whether we find
a greater character, but "how much greater" it is. This would apply to
my suggestion of incrementing the top byte without considering
lower-order bytes --- we'd be skipping quite a lot of code space for
each increment, and it's conceivable that that would be quite hurtful in
some cases. Not sure how to account for that either. An extreme
example here is an "incrementer" that just immediately returns the last
character in the sort order for any lesser input.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 16:09:45
Message-ID: CA+TgmoaDz_Ri7NcLebzbb_EzTxEGy_mwwsenLPrh9JaTVSL1kA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 22, 2011 at 11:46 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Well, the metric that we were indirectly using earlier was the
> number of characters in a given locale for which the algorithm
> fails to find a greater one (excluding whichever character is "last",
> I guess, or you could just recognize there's always at least one).

What about characters that sort differently in sequence than individually?

>> For example, if we were to try all possible
>> three character strings in some encoding and run make_greater_string()
>> on each one of them, we could then measure the failure percentage.  Or
>> if that's too many cases to crank through then we could limit it some
>> way -
>
> Even in UTF8 there's only a couple million assigned code points, so for
> test purposes anyway it doesn't seem like we couldn't crank through them
> all.  Also, in many cases you could probably figure it out by analysis
> instead of brute-force testing every case.
>
> A more reasonable objection might be that a whole lot of those code
> points are things nobody cares about, and so we need to weight the
> results somehow by the actual popularity of the character.  Not sure
> how to take that into account.

I guess whether that's a problem in practice will depend somewhat on
the quality of the algorithms we're able to find. If our best
algorithms still have a 1% failure rate, then yeah, that's an issue,
but in that case I'd suggest that our best algorithms suck and we need
to think harder about alternate solutions. If we're talking about
failing on 5 characters out of a million we can just eyeball them.
I'm not trying to reduce this testing to something that is entirely
mechanic in every way; I'm just saying that I'm not optimistic about
my ability to judge which algorithms will work best in practice
without some kind of automated aid.

> Another issue here is that we need to consider not just whether we find
> a greater character, but "how much greater" it is.  This would apply to
> my suggestion of incrementing the top byte without considering
> lower-order bytes --- we'd be skipping quite a lot of code space for
> each increment, and it's conceivable that that would be quite hurtful in
> some cases.  Not sure how to account for that either.  An extreme
> example here is an "incrementer" that just immediately returns the last
> character in the sort order for any lesser input.

Right... well, this is why I'm not wild about doing this by
incrementing in the first place.

But now that I think about it, what about using some
slightly-less-stupid version of that approach as a fallback strategy?
For example, we could pick, oh, say, 20 characters out of the space of
code points, about evenly distributed under whatever collations we
think are likely to be in use. In the incrementer, we try some kind
of increment-the-bytes strategy for a while and if it doesn't pan out,
we zip through the array and try substituting each of the fallback
characters. If more than one works, we test the survivors against
each other until we're left with just one winner. The bound might not
be real tight, but as long as it's good enough to make the planner
pick an index scan it might not matter very much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 16:35:56
Message-ID: 23159.1316709356@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Sep 22, 2011 at 11:46 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, the metric that we were indirectly using earlier was the
>> number of characters in a given locale for which the algorithm
>> fails to find a greater one (excluding whichever character is "last",
>> I guess, or you could just recognize there's always at least one).

> What about characters that sort differently in sequence than individually?

Yeah, there's a whole 'nother set of issues there, but the character
incrementer is unlikely to affect that very much either way, I think.

> But now that I think about it, what about using some
> slightly-less-stupid version of that approach as a fallback strategy?
> For example, we could pick, oh, say, 20 characters out of the space of
> code points, about evenly distributed under whatever collations we
> think are likely to be in use.

Sure, if the "increment the top byte" strategy proves to not accomplish
that effectively. But I'd prefer not to design a complex strategy until
it's been proven that a simpler one doesn't work.

regards, tom lane


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-23 09:16:47
Message-ID: 20110923.181647.101373171.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi, I think I have comprehended roughly around the constructs and
the concept underlying.

At Thu, 22 Sep 2011 12:35:56 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <23159(dot)1316709356(at)sss(dot)pgh(dot)pa(dot)us>
tgl> Sure, if the "increment the top byte" strategy proves to not accomplish
tgl> that effectively. But I'd prefer not to design a complex strategy until
tgl> it's been proven that a simpler one doesn't work.

Ok, I understand indistinctly that thought. But I have not grasp
your measure for the complexity.

The current make_greater_string tips up the tail of bare byte
sequence and cheks the whole byte sequence to be valid against
the database encoding and try the next if not. On the other hand,
the patch (although the style is corrupted..) searches for the
last CHARACTER and try to tipping the last CARACTER up and
decline if failed.

Looking within the scope of the function make_greater_string,
feel more complexity on the former because of the check and loop
construct.

Yes, altough the `tipping the character up' has complexity
within, but the complexity is capsulated within single function.

From the performance perspective, the current implement always
slipps 64 times (0xff - 0xbf, for UTF8) and checks the WHOLE
pattern string on every slippage, and eventually declines for the
only but not negligible 100 (within Japanese chars only) code
points. The check-and-retry loop can't be a help for these
cases. And checks the whole pattern string at least once
nevertheless successfully landed.

While the algorithm of the patch seeks the whole pattern string
to find the last character but makes no slippage for whole the
code space and declines only on the point of chainging the
character length. (Of cource it is possible to save thses points
but it is `too expensive' for the gain to me:). Not only
checking the whole string, but also checking the character after
increment operation is essentially needless for this method.

To summarise from my view, these two methods seems not so
different on performance for the `incrementable's by current
method and the latter seems rather efficient and applicable for
most of the `unincrementable's.

The patch now does cheking the validity of the result as
last-resort because of the possible inconsistency caused by
careless chainging of the validity check function (changing the
character set, in other word, very unlikely.). But It is
unnessessary itself if the consistency between the incrementer
and the validator has been checked after the last modification.
The four-bytes memcpy would be get out by changing the rewinding
method. These modifications make the operation gets low-cost (I
think..) for UTF8 and it for others left untouched with regard to
the behavior.

As I've written in previous message, the modification on
pg_wchar_table can be rewinded before until another incrementer
comes.

Can I have another chance to show the another version of the
patch according to the above?

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-23 12:39:44
Message-ID: CA+TgmoZEoR4UbYoiAb=NOJzwmswv7_dt-pp1ATi_gE2g8XnMoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Anyway, I won't stand in the way of the patch as long as it's modified
> to limit the number of values considered for any one character position
> to something reasonably small.

I think that limit in both the old and new code is 1, except that the
new code does it more efficiently.

Am I confused?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-23 12:42:42
Message-ID: CA+TgmoYgUG_0rFmasjDdHi=qxxD+pU8HM=fWMs+0kRyhGUwSkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Sep 23, 2011 at 5:16 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Can I have another chance to show the another version of the
> patch according to the above?

You can always post a new version of any patch.

I think what you need to focus on is cleaning up the coding style to
match PG project standards. In particular, you need to fix your brace
positioning and the way function declarations are laid out. Someone
will have to do that before this is committed, and ideally that person
should be you rather than, say, me. :-)

It would be good to fix the memcpy() issue Tom found too, and make a
careful examination of the code for anything else that can be tidied
up or made more bulletproof.

It seems like we are getting close to agreement on the basic behavior,
so that is good. We can always fine-tune that later; that shouldn't
be much work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-23 12:51:28
Message-ID: 24673.1316782288@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Anyway, I won't stand in the way of the patch as long as it's modified
>> to limit the number of values considered for any one character position
>> to something reasonably small.

> I think that limit in both the old and new code is 1, except that the
> new code does it more efficiently.

> Am I confused?

Yes, or else I am. Consider a 4-byte UTF8 character at the end of the
string. The existing code increments the last byte up to 255 (rejecting
everything past 0xBF), then gives up and truncates that character away.
So the maximum number of tries for that character position is between 0
and 127 depending on what the original character was (with at most 63 of
the incremented values getting past the verifymbstr test).

The proposed patch is going to iterate through all Unicode code points
up to U+7FFFFF before giving up. Since it's possible that we need to
increment something further left to succeed at all, this doesn't seem
like a good plan.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-23 13:11:59
Message-ID: CA+TgmoavQP28OY0QRkXSQiS131u2sEFc7e72yU3x=pjr0BPU=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Sep 23, 2011 at 8:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Anyway, I won't stand in the way of the patch as long as it's modified
>>> to limit the number of values considered for any one character position
>>> to something reasonably small.
>
>> I think that limit in both the old and new code is 1, except that the
>> new code does it more efficiently.
>
>> Am I confused?
>
> Yes, or else I am.  Consider a 4-byte UTF8 character at the end of the
> string.  The existing code increments the last byte up to 255 (rejecting
> everything past 0xBF), then gives up and truncates that character away.
> So the maximum number of tries for that character position is between 0
> and 127 depending on what the original character was (with at most 63 of
> the incremented values getting past the verifymbstr test).
>
> The proposed patch is going to iterate through all Unicode code points
> up to U+7FFFFF before giving up.  Since it's possible that we need to
> increment something further left to succeed at all, this doesn't seem
> like a good plan.

I think you're misreading the code. It does this:

while (len > 0)
{
boring stuff;
if (charincfunc(lastchar, charlen))
{
more boring stuff;
if (we made a greater string)
return it;
cleanup;
}
truncate away last character;
}

I don't see how that's ever going to try more than one character in
the same position.

What may be confusing you is that the old code has two loops: an outer
loop that tests whether we've made a greater string, and an inner loop
that tests whether we've made a validly encoded string at all. In the
new code, at least in the UTF-8 case, the inner loop is GONE
altogether. Instead of iterating until we construct a valid
character, we just use our mad UTF-8 skillz to assemble one, and
return it.

Or else I need to go drink a few cups of tea and look at this again.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Marcin Mańk <marcin(dot)mank(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-23 17:35:48
Message-ID: 624360DC-C26A-4FA6-8C6A-26D28EA79422@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

One idea:
col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character. This should be good enough for calculating stats.

How to find such a character, i do not know.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Marcin Mańk <marcin(dot)mank(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-26 10:52:23
Message-ID: 1317034343.1759.13.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On fre, 2011-09-23 at 20:35 +0300, Marcin Mańk wrote:
> One idea:
> col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character. This should be good enough for calculating stats.
>
> How to find such a character, i do not know.

That's what makes this so difficult.

If we knew the largest character, we could probably also find the
largest-1, largest-2, etc. characters and determine the total order of
everything.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Marcin Mańk <marcin(dot)mank(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-26 14:08:54
Message-ID: 5358.1317046134@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On fre, 2011-09-23 at 20:35 +0300, Marcin Mak wrote:
>> One idea:
>> col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character. This should be good enough for calculating stats.
>> How to find such a character, i do not know.

> That's what makes this so difficult.

> If we knew the largest character, we could probably also find the
> largest-1, largest-2, etc. characters and determine the total order of
> everything.

No, it's a hundred times worse than that, because in collations other
than C there typically *is* no total order. The collation behavior of
many characters is context-sensitive, thanks to the multi-pass behavior
of typical "dictionary" algorithms.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marcin Mańk <marcin(dot)mank(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-26 14:30:32
Message-ID: 1317047432.1759.27.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On mån, 2011-09-26 at 10:08 -0400, Tom Lane wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> > On fre, 2011-09-23 at 20:35 +0300, Marcin Mańk wrote:
> >> One idea:
> >> col like 'foo%' could be translated to col >= 'foo' and col <= foo || 'zzz' , where 'z' is the largest possible character. This should be good enough for calculating stats.
> >> How to find such a character, i do not know.
>
> > That's what makes this so difficult.
>
> > If we knew the largest character, we could probably also find the
> > largest-1, largest-2, etc. characters and determine the total order of
> > everything.
>
> No, it's a hundred times worse than that, because in collations other
> than C there typically *is* no total order. The collation behavior of
> many characters is context-sensitive, thanks to the multi-pass behavior
> of typical "dictionary" algorithms.

Well, there is a total order of all strings, but it's not consistent
under string concatenation.

But there is a "largest character". If the collation implementation
uses four weights (the typical case), the largest character is the one
that maps to <FFFF> <FFFF> <FFFF> <FFFF>. If you appended that
character to a string, you would get a larger string. (Unless there are
French backwards levels or other funny things in place, perhaps.) But
we don't know which character that is, and likely there isn't one, so
we'd need to largest character that maps to an actually assigned weight,
and that's not possible without exhaustive search of all collating
elements.

We could possibly try to make this whole thing work differently by
storing the strxfrm results in the histograms.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Marcin Mańk <marcin(dot)mank(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-26 15:52:16
Message-ID: 7465.1317052336@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On mn, 2011-09-26 at 10:08 -0400, Tom Lane wrote:
>> No, it's a hundred times worse than that, because in collations other
>> than C there typically *is* no total order. The collation behavior of
>> many characters is context-sensitive, thanks to the multi-pass behavior
>> of typical "dictionary" algorithms.

> Well, there is a total order of all strings, but it's not consistent
> under string concatenation.

> But there is a "largest character". If the collation implementation
> uses four weights (the typical case), the largest character is the one
> that maps to <FFFF> <FFFF> <FFFF> <FFFF>. If you appended that
> character to a string, you would get a larger string. (Unless there are
> French backwards levels or other funny things in place, perhaps.)

But the problem is not "make a string greater than this given one".
It is "make a string greater than any string that begins with this
given one". As an example, suppose we are given "xyz" where "z" is
the last letter in the collation. We can probably find characters
such as "~" that are greater than "z", but no string x-y-nonletter
is going to be considered greater than x-y-z-z by a dictionary
sorting method. This is why make_greater_string has to be prepared
to give up and go increment some character before the last one:
the only way to succeed for this input is to construct "xz".

regards, tom lane


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-29 10:24:26
Message-ID: 20110929.192426.194906490.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

This is new version of make_greater_string patch.

1. wchar.c:1532 pg_wchar_table: Restore the pg_wchar_table.

2. wchar.c:1371 pg_utf8_increment: Remove dangerous memcpy, but
one memcpy is left because it's safe. Remove code check after
increment.

3. wchar.c:1429 pg_eucjp_increment: Remove dangerous
memcpy. Remove code check after increment. Minor bug fix.

4. wchar.c:1654 pg_database_encoding_character_incrementer:
Select increment function by switch-select.

horiguchi.kyotaro> This is rebased patch of `Allow encoding specific character
horiguchi.kyotaro> incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602).

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
charinc_20110929_v2.patch text/x-patch 9.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-03 18:13:46
Message-ID: CA+TgmoZbG64f8yF-wzY9JvcMHz4yCRx7rVqv4_DXtjHqJ44Y=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 29, 2011 at 6:24 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> This is new version of make_greater_string patch.

According to the comments in the original source code, the purpose of
savelastchar is to avoid confusing pg_mbcliplen(). You've preserved
savelastchar only for the case where datatype == BYTEAOID, while
making it the increment function's job not to do anything permanent
unless it also returns true. But it seems to me that if the datatype
is BYTEAOID then there's no need to restore anything at all, because
we're not going to call pg_mbcliplen() in that case anyway. So I
think the logic here can be simplified.

Also, you haven't completely fixed the style issues. Function
definitions should look like this:

static void
thingy()
{
}

Not like this:

static void thingy()
{
}

Opening curly braces should be on a line by themselves, not at the end
of the preceding if, while, etc. line.

"finnaly" is spelled incorrectly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-04 02:22:54
Message-ID: CA+TgmoZ8n79LMoSESYdNj6MT7YOOFoyi5GHdCcHx7Q12FgLW6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Oct 3, 2011 at 2:13 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 29, 2011 at 6:24 AM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
>> This is new version of make_greater_string patch.
>
> According to the comments in the original source code, the purpose of
> savelastchar is to avoid confusing pg_mbcliplen().  You've preserved
> savelastchar only for the case where datatype == BYTEAOID, while
> making it the increment function's job not to do anything permanent
> unless it also returns true.  But it seems to me that if the datatype
> is BYTEAOID then there's no need to restore anything at all, because
> we're not going to call pg_mbcliplen() in that case anyway.  So I
> think the logic here can be simplified.
>
> Also, you haven't completely fixed the style issues.  Function
> definitions should look like this:
>
> static void
> thingy()
> {
> }
>
> Not like this:
>
> static void thingy()
> {
> }
>
> Opening curly braces should be on a line by themselves, not at the end
> of the preceding if, while, etc. line.
>
> "finnaly" is spelled incorrectly.

Oh, and there's this:

wchar.c: In function ‘pg_utf8_increment’:
wchar.c:1376: warning: unused variable ‘success’
wchar.c: In function ‘pg_eucjp_increment’:
wchar.c:1433: warning: unused variable ‘success’

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-07 04:22:46
Message-ID: 20111007.132246.187325868.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thank you for reviewing.

The new version of this patch is attached to this message.

> > But it seems to me that if the datatype is BYTEAOID then
> > there's no need to restore anything at all, because we're not
> > going to call pg_mbcliplen() in that case anyway.  So I think
> > the logic here can be simplified.

I agree with you. The original code does not restore the changed
byte. I removed the lastchar preservation from
make_greater_string().

> > Also, you haven't completely fixed the style issues.  Function
> > definitions should look like this:
..
> > Opening curly braces should be on a line by themselves, not at the end
> > of the preceding if, while, etc. line.
> >
> > "finnaly" is spelled incorrectly.

I'm very sorry for left mixed defferent style a lot. I think
I've done the correction of the styles for function definition
and braces. The misspelled word is removed with whole sentense
because re-cheking just before return had been removed.

> Oh, and there's this:
>
> wchar.c: In function ‘pg_utf8_increment’:
> wchar.c:1376: warning: unused variable ‘success’
> wchar.c: In function ‘pg_eucjp_increment’:
> wchar.c:1433: warning: unused variable ‘success’

Oops... I have rebased the patch and removed all warnings.
make check has been passed all-ok.

I confirmed that the planner decides to use index with proper
boundaries for like expression with the certain characters on
problematic code point, on the database encodings both UTF-8 and
EUC-JP with the database locale is C, and database locale is
ja_JP.UTF8. And also for bytea ends with 0xff and 0x00, 0x01.

This is the third version of the patch.

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
charinc_20111007_v3.patch text/x-patch 8.7 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-07 16:25:08
Message-ID: CA+Tgmoao2OoZBMUsfp3ZC0_LgXsv3jBvY9eYR5h+cZYez7j26Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Oct 7, 2011 at 12:22 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Thank you for reviewing.
>
> The new version of this patch is attached to this message.

OK, I think this is reasonably close to being committable now. There
are a few remaining style and grammar mistakes but I can fix those up
before committing. One thing I still think it would be useful to add,
though, is some comments to pg_utf8_increment() and
pg_eucjp_increment() describing the algorithm being used. Can you
take a crack at that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-11 07:55:00
Message-ID: 20111011.165500.124520773.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


At Fri, 7 Oct 2011 12:25:08 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+Tgmoao2OoZBMUsfp3ZC0_LgXsv3jBvY9eYR5h+cZYez7j26Q(at)mail(dot)gmail(dot)com>
> OK, I think this is reasonably close to being committable now. There
> are a few remaining style and grammar mistakes but I can fix those up
> before committing.

Thank you.

> One thing I still think it would be useful to add,
> though, is some comments to pg_utf8_increment() and
> pg_eucjp_increment() describing the algorithm being used. Can you
> take a crack at that?

Yes I'll do it in a day or two.

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-13 03:45:31
Message-ID: 20111013.124530.97708970.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello, the work is finished.

Version 4 of the patch is attached to this message.

- Add rough description of the algorithm as comment to
pg_utf8_increment() and pg_eucjp_increment().

- Fixed a bug of pg_utf8_increment() found while
working. pg_(utf8|eucjp)_increment are retested on whole valid
code points to be properly handled.

- The comment previously pointed out as being wrong in grammar
is left untouched. I'm sorry to bother you with my poor
English.

At Tue, 11 Oct 2011 16:55:00 +0900 (JST), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote
> > One thing I still think it would be useful to add,
> > though, is some comments to pg_utf8_increment() and
> > pg_eucjp_increment() describing the algorithm being used. Can you
> > take a crack at that?
>
> Yes I'll do it in a day or two.

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
charinc_20111013_v4.patch text/x-patch 11.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-18 03:45:05
Message-ID: CA+TgmoYETjFMP2hFzWwCxEi2OQKA+NP5CY-DMPnasxNCgX+2rg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, Oct 12, 2011 at 11:45 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Hello, the work is finished.
>
>  Version 4 of the patch is attached to this message.

I went through this in a bit more detail tonight and am cleaning it
up. But I'm a bit confused, looking at pg_utf8_increment() in detail:

- Why does the second byte need special handling for 0xED and 0xF4?
AFAICT, UTF-8 requires all legal strings to have a second byte between
0x80 and 0xBF, just as in byte positions 3 and 4, so these bytes would
be invalid in this position anyway.
- In the first byte, we don't increment if the current value for that
byte is 0x7F, 0xDF, 0xEF, or 0xF4. But why isn't it 0xF7 rather than
0xF4? I see there's a comparable restriction in pg_utf8_islegal(),
but I don't understand why.
- Perhaps on the same point, the comments claim that we will fail for
code points U+0007F, U+007FF, U+0FFFF, and U+10FFFF. But IIUC, a
4-byte unicode character can encode values up to U+1FFFFF, so why is
it U+10FFFF rather than U+1FFFFF?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-18 03:54:45
Message-ID: 25893.1318910085@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> - Why does the second byte need special handling for 0xED and 0xF4?

http://www.faqs.org/rfcs/rfc3629.html

See section 4 in particular. The underlying requirement is to disallow
multiple representations of the same Unicode code point.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-18 04:06:11
Message-ID: CA+TgmobVjxK6DAuGHJ=nLgnfYHP4s7EAp19E86pzd4DudGhY4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Oct 17, 2011 at 11:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> - Why does the second byte need special handling for 0xED and 0xF4?
>
> http://www.faqs.org/rfcs/rfc3629.html
>
> See section 4 in particular.  The underlying requirement is to disallow
> multiple representations of the same Unicode code point.

I'm still confused. The input string is already known to be valid
UTF-8, so the second byte (if there is one) must be between 0x80 and
0xBF. Therefore it will be neither 0xED nor 0xF4.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-18 04:11:38
Message-ID: 26142.1318911098@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Oct 17, 2011 at 11:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> http://www.faqs.org/rfcs/rfc3629.html

> I'm still confused. The input string is already known to be valid
> UTF-8, so the second byte (if there is one) must be between 0x80 and
> 0xBF. Therefore it will be neither 0xED nor 0xF4.

I haven't read the patch lately, but ED and F4 are special as *first*
bytes. Maybe the logic isn't quite right, or you read it wrong?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-18 13:32:24
Message-ID: CA+TgmoZ+3gv=x7ScvFbJgE4pywRfy8QQ=ymDvKoRoWxVDGPWjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Oct 18, 2011 at 12:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Oct 17, 2011 at 11:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> http://www.faqs.org/rfcs/rfc3629.html
>
>> I'm still confused.  The input string is already known to be valid
>> UTF-8, so the second byte (if there is one) must be between 0x80 and
>> 0xBF.  Therefore it will be neither 0xED nor 0xF4.
>
> I haven't read the patch lately, but ED and F4 are special as *first*
> bytes.  Maybe the logic isn't quite right, or you read it wrong?

I think I'll let the patch author comment on that. It looks wrong to
me, but I just work here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: robertmhaas(at)gmail(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-21 01:36:46
Message-ID: 20111021.103646.221883029.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> - Why does the second byte need special handling for 0xED and 0xF4?
> >
> > http://www.faqs.org/rfcs/rfc3629.html
> >
> > See section 4 in particular.  The underlying requirement is to disallow
> > multiple representations of the same Unicode code point.

The special handling skips the utf8 code regions corresponds to
the regions U+D800 - U+DFFF and U+110000 - U+11ffff in ucs-4. The
former is reserved for use with the UTF-16 encoding form as
surrougate pairs and do not directly represent characters as
described in section 3 of rfc3629. The latter is the region which
is out of the utf-8 range by the definition described also in the
same section.

former> The definition of UTF-8 prohibits encoding character
former> numbers between U+D800 and U+DFFF, which are reserved for
former> use with the UTF-16 encoding form (as surrogate pairs)
former> and do not directly represent characters.

latter> In UTF-8, characters from the U+0000..U+10FFFF range (the
latter> UTF-16 accessible range) are encoded using sequences of 1
latter> to 4 octets.

# However, I wrote this exception simplly mimicked the
# pg_utf8_validator()'s behavior at the beginning.

This must be the basis of the behavior of pg_utf8_verifier(), and
pg_utf8_increment() has taken over it. It may be good to describe
this origin of the special handling as comment of these functions
to avoid this sort of confusion.

> I'm still confused. The input string is already known to be valid
> UTF-8, so the second byte (if there is one) must be between 0x80 and
> 0xBF. Therefore it will be neither 0xED nor 0xF4.

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-23 02:16:58
Message-ID: CA+Tgmobhk5mWdoSjZFZ+qU_Z5tN2Gd8EVCBoraYydwXd3PtLpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Oct 20, 2011 at 9:36 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> This must be the basis of the behavior of pg_utf8_verifier(), and
> pg_utf8_increment() has taken over it. It may be good to describe
> this origin of the special handling as comment of these functions
> to avoid this sort of confusion.

Oh, you know what? I'm misreading this code. *facepalm*

I thought that code was looking for 0xED/0xF4 in the second position,
but it's actually looking for them in the first position, which makes
vastly more sense. Whee!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp
To: robertmhaas(at)gmail(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-29 17:16:03
Message-ID: 20111030.021603.01379645.horiguchi.kyotaro@horiguti.oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello, I feel at a loss what to do...

> I thought that code was looking for 0xED/0xF4 in the second position,
> but it's actually looking for them in the first position, which makes
> vastly more sense. Whee!

Anyway, I try to describe another aspect of this code a the present.

The switch block in the g_utf8_increnet is a folded code of five
individual manipulation according to the byte-length of the
sequence. The separation presupposes the input bytes and length
formes a valid utf-8 sequence.

For a character more than 5 byte length, retunes false.

For 4 bytes, the sequence ranges between U+10000 and U+1fffff.

If charptr[3] is less than 0xbf, increment it and return true.

Else assign 0x80 to charptr[3] and then if charptr[2] is less
than 0xbf increment it and return true.

Else assign 0x80 to charptr[2] and then,
if (charptr[1] is less than 0x8f when charptr[0] == 0xf4) or
(charptr[1] is less than 0xbf when charptr[0] != 0xf4)
increment it and return true.

Else assign 0x80 to charptr[1] and then if charptr[0] is not
0xf4 increment it and return true.

Else the input sequence must be 0xf4 0x8f 0xbf 0xbf which
represents U+10ffff and this is the upper limit of UTF-8
representation. Restore the sequnce and return false.

for 3 bytes, the sequence ranges between u+800 and u+ffff.

If charptr[2] is less than 0xbf increment it and reutrn true.

Else assign 0x80 to charptr[2] and then,
if (charptr[1] is less than 0x9f when charptr[0] == 0xed) or
(charptr[1] is less than 0xbf when charptr[0] != 0xed)
increment it and return true.

The sequence 0xed 0x9f 0xbf represents U+d7ff will
incremented to 0xef 0x80 0x80 (U+f000) at the end.

Else assign 0x80 to charptr[1] and then if charptr[0] is not
0xef increment it and return true.

Else the input sequence must be 0xef 0xbf 0xbf which represents
U+ffff and the next UTF8 sequence has the length of 4. Restore
the sequnce and return false.

For 2 bytes, the sequence ranges between U+80 and U+7ff.

If charptr[1] is less than 0xbf increment it and reutrn true.

Else assign 0x80 to charptr[1] and then if charptr[0] is not
0xdf increment it and return true.

Else the input sequence must be 0xdf 0xbf which reporesents
U+7ff and next UTF8 sequence has the length of 3. Restore the
sequence and return false.

For 1 byte, the byte ranges between U+0 and U+7f.

If charptr[0] is less than 0x7f increment it and return true.

Else the input sequence must be 0x7f which represents U+7f and
next UTF8 sequence has the length of 2. Restore the sequence
and return false.

--
Kyotaro Horiguchi


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-29 18:26:28
Message-ID: CA+TgmobMOox8xVcJ62OoqfH5CQyBEm-2z4SFywwC-VgAtZDE_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, Oct 29, 2011 at 1:16 PM, <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Hello, I feel at a loss what to do...
>
>> I thought that code was looking for 0xED/0xF4 in the second position,
>> but it's actually looking for them in the first position, which makes
>> vastly more sense.  Whee!
>
> Anyway, I try to describe another aspect of this code a the present.

I've committed this, after a good deal of hacking on the comments,
some coding style cleanup, and one bug fix:

+ workstr[len] = '\0';

Without that, when we start fiddling with the second-to-last
character, the last one is still hanging around, which is different
than the old behavior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-29 19:35:27
Message-ID: 28652.1319916927@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I've committed this, after a good deal of hacking on the comments,
> some coding style cleanup, and one bug fix:

Ummm ... why do the incrementer functions think they need to restore the
previous value on failure? AFAICS that's a waste of code and cycles,
since there is only one caller and it doesn't care in the least.

I'm also quite distressed that you ignored my advice to limit the number
of combinations tried. This patch could be horribly slow when dealing
with wide characters, eg think what will happen when starting from
U+10000.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-29 20:15:16
Message-ID: CA+Tgmobs+-oxMrps_rfOFwx_7FknFd2qwnwvGTSiv1tS=CXUrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, Oct 29, 2011 at 3:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I've committed this, after a good deal of hacking on the comments,
>> some coding style cleanup, and one bug fix:
>
> Ummm ... why do the incrementer functions think they need to restore the
> previous value on failure?  AFAICS that's a waste of code and cycles,
> since there is only one caller and it doesn't care in the least.

Well, it might not be strictly necessary for pg_utf8_increment() and
pg_eucjp_increment(), but it's clearly necessary for the generic
incrementer function for exactly the same reason it was needed in the
old coding. I suppose we could weaken the rule to "you must leave a
valid character behind rather than a bunch of bytes that doesn't
encode to a character", but the cycle savings are negligible and the
current rule seems both simpler and more bullet-proof.

> I'm also quite distressed that you ignored my advice to limit the number
> of combinations tried.  This patch could be horribly slow when dealing
> with wide characters, eg think what will happen when starting from
> U+10000.

Uh, I think it will try at most one character in that position and
then truncate away that character entirely, per my last email on this
topic (to which you never responded):

http://archives.postgresql.org/pgsql-hackers/2011-09/msg01195.php

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-29 20:36:06
Message-ID: 29562.1319920566@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Oct 29, 2011 at 3:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Ummm ... why do the incrementer functions think they need to restore the
>> previous value on failure? AFAICS that's a waste of code and cycles,
>> since there is only one caller and it doesn't care in the least.

> Well, it might not be strictly necessary for pg_utf8_increment() and
> pg_eucjp_increment(), but it's clearly necessary for the generic
> incrementer function for exactly the same reason it was needed in the
> old coding. I suppose we could weaken the rule to "you must leave a
> valid character behind rather than a bunch of bytes that doesn't
> encode to a character", but the cycle savings are negligible and the
> current rule seems both simpler and more bullet-proof.

No, it's *not* necessary any more, AFAICS. make_greater_string discards
the data and overwrites it with a null instantly upon getting a failure
return from the incrementer. The reason we used to need it was that we
did pg_mbcliplen after failing to increment, but now we do that before
we ever increment anything, so we already know the length of the last
character. It doesn't matter whether those bytes are still valid or
contain garbage.

>> I'm also quite distressed that you ignored my advice to limit the number
>> of combinations tried. This patch could be horribly slow when dealing
>> with wide characters, eg think what will happen when starting from
>> U+10000.

> Uh, I think it will try at most one character in that position and
> then truncate away that character entirely, per my last email on this
> topic (to which you never responded):
> http://archives.postgresql.org/pgsql-hackers/2011-09/msg01195.php

Oh! You are right, I was expecting it to try multiple characters at the
same position before truncating the string. This change seems to have
lobotomized things rather thoroughly. What is the rationale for that?
As an example, when dealing with a single-character string, it will fail
altogether if the next code value sorts out-of-order, so this seems to
me to be a rather large step backwards.

I think we ought to go back to the previous design of incrementing till
failure and then truncating, which puts the onus on the incrementer to
make a reasonable tradeoff of how many combinations to try per character
position. There's a simple tweak we could make to the patch to limit
that: once we've maxed out a lower-order byte of a multibyte char,
*don't reset it to minimum*, just move on to incrementing the next
higher byte. This preserves the old property that the maximum number of
combinations tried is bounded by 256 * string's length in bytes.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-30 00:04:09
Message-ID: CA+TgmoZbXGOCYCBrb1T6w8sVDxOA4yP-B8CPtXv4FxxDwRWwtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, Oct 29, 2011 at 4:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, it might not be strictly necessary for pg_utf8_increment() and
>> pg_eucjp_increment(), but it's clearly necessary for the generic
>> incrementer function for exactly the same reason it was needed in the
>> old coding.  I suppose we could weaken the rule to "you must leave a
>> valid character behind rather than a bunch of bytes that doesn't
>> encode to a character", but the cycle savings are negligible and the
>> current rule seems both simpler and more bullet-proof.
>
> No, it's *not* necessary any more, AFAICS.  make_greater_string discards
> the data and overwrites it with a null instantly upon getting a failure
> return from the incrementer.  The reason we used to need it was that we
> did pg_mbcliplen after failing to increment, but now we do that before
> we ever increment anything, so we already know the length of the last
> character.  It doesn't matter whether those bytes are still valid or
> contain garbage.

Oh, dude. I think you are right. I guess we can rip that crap out
then. That's much cleaner.

>>> I'm also quite distressed that you ignored my advice to limit the number
>>> of combinations tried.  This patch could be horribly slow when dealing
>>> with wide characters, eg think what will happen when starting from
>>> U+10000.
>
>> Uh, I think it will try at most one character in that position and
>> then truncate away that character entirely, per my last email on this
>> topic (to which you never responded):
>> http://archives.postgresql.org/pgsql-hackers/2011-09/msg01195.php
>
> Oh!  You are right, I was expecting it to try multiple characters at the
> same position before truncating the string.  This change seems to have
> lobotomized things rather thoroughly.  What is the rationale for that?
> As an example, when dealing with a single-character string, it will fail
> altogether if the next code value sorts out-of-order, so this seems to
> me to be a rather large step backwards.
>
> I think we ought to go back to the previous design of incrementing till
> failure and then truncating, which puts the onus on the incrementer to
> make a reasonable tradeoff of how many combinations to try per character
> position.  There's a simple tweak we could make to the patch to limit
> that: once we've maxed out a lower-order byte of a multibyte char,
> *don't reset it to minimum*, just move on to incrementing the next
> higher byte.  This preserves the old property that the maximum number of
> combinations tried is bounded by 256 * string's length in bytes.

On this point I believe you are still confused. The old code tried
one character per position, and the new code tries one character per
position. Nothing has been lobotomized in any way. The difference is
that the old code used a "guess and check" approach to generate the
character, so there was an inner loop that was trying to generate a
character (possibly generating various garbage strings that did not
represent a character along the way) and then, upon success, checked
the sort order of that single string before truncating and retrying.
The new code does exactly the same thing in the outer loop - i.e.
truncate one character per iteration - but the inner loop has, at
least for UTF-8 and EUC-JP, been replaced with an algorithm that is
guaranteed to produce a valid character without needing to loop.

Now having said that, I think there is a possibility for some
improvement here. If we know we're not going to spend a lot of time
uselessly screwing around trying to get something that will pass
pg_verifymbstr(), then we could probably afford to call ltproc several
times per position, rather than just once. But that's not restoring
the behavior of the old algorithm; that's improving on the old
algorithm. And before we do that, we need to think about a couple of
things: first, that silly looping behavior is still there for anything
other than UTF-8 and EUC-JP. Until we have constant-time increment
functions for every encoding we support, we probably don't want to get
too jiggy with it; second, we need to convince ourselves that this
will succeed in a meaningful number of cases where the current
algorithm fails. I played around a bit with the UTF-8 case (with
collation = en_US.UTF-8) before committing this and I suspect that
trying 4 or 5 characters per position could be a win - you might for
example be looking at something like an accented E, and you might have
several different versions of E in a row before you get to something
that's no longer E-like. If you get to that point and still don't
find something that compares favorably, it's probably time to throw in
the towel. Another idea would be to do the first increment by one,
and then increment by two, four, eight before giving up, or something
like that. It probably needs some research, and frankly I'm happy to
leave it to someone who is having a real-world problem with it. The
fact that we haven't gotten any complaints before suggests that this
actually works decently well as it stands.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-30 14:58:53
Message-ID: 21986.1319986733@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Oct 29, 2011 at 4:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Oh! You are right, I was expecting it to try multiple characters at the
>> same position before truncating the string. This change seems to have
>> lobotomized things rather thoroughly. What is the rationale for that?
>> As an example, when dealing with a single-character string, it will fail
>> altogether if the next code value sorts out-of-order, so this seems to
>> me to be a rather large step backwards.

> On this point I believe you are still confused. The old code tried
> one character per position, and the new code tries one character per
> position. Nothing has been lobotomized in any way.

No, on this point you are flat out wrong. Try something like

select ... where f1 like 'p%';

in tt_RU locale, wherein 'q' sorts between 'k' and 'l'. The old code
correctly found that 'r' works as a string greater than 'p'. The new
code fails to find a greater string, because it only tries 'q' and then
gives up. This results in a selectivity estimate much poorer than
necessary.

Since the stated purpose of this patch is to fix some corner cases
where the code fails to find a greater string, I fail to see why it's
acceptable to introduce some other corner cases that weren't there
before.

> The difference is
> that the old code used a "guess and check" approach to generate the
> character, so there was an inner loop that was trying to generate a
> character (possibly generating various garbage strings that did not
> represent a character along the way) and then, upon success, checked
> the sort order of that single string before truncating and retrying.

You are misreading the old code. The inner loop increments the last
byte, checks for success, and if it hasn't produced a greater string
then it loops around to increment again.

> The fact that we haven't gotten any complaints before suggests that this
> actually works decently well as it stands.

Well, that's true of the old algorithm ;-)

I had likewise thought of the idea of trying some fixed number of
character values at each position, but it's unclear to me why that's
better than allowing an encoding-specific behavior. I don't believe
that we could get away with trying less than a few dozen values, though.
For example, in a situation where case sensitivity is relevant, you
might need to increment past all the upper-case letters to get to a
suitable lower-case letter. I also think that it's probably useful to
try incrementing higher-order bytes of a multibyte character before
giving up --- we just can't afford to do an exhaustive search.
Thus my proposal to let the low-order bytes max out but not cycle.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-30 18:49:40
Message-ID: CA+TgmobobRMe6gC9S2ijS7=9HtptP1fX41AMV5y4L+b1PG6ApQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Oct 30, 2011 at 10:58 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> You are misreading the old code.  The inner loop increments the last
> byte, checks for success, and if it hasn't produced a greater string
> then it loops around to increment again.

Ugh. You're right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company