Re: B-Tree support function number 3 (strxfrm() optimization)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-07-28 23:41:06
Message-ID: CAM3SWZQzsAUYFi2Wf9q3OXg+WuWzBu2zkPqVmROP5CGZm7xKtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 27, 2014 at 12:34 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> It's more or less testing for a primary weight level (i.e. the first
> part of the blob) that is no larger than the original characters of
> the string, and has no "header bytes" or other redundancies. It also
> matches secondary and subsequently weight levels to ensure that they
> match, since the two stings tested have identical case, use of
> diacritics, etc (they're both lowercase ASCII-safe strings). I don't
> set a locale, but that shouldn't matter.

Actually, come to think of it that might not quite be true. Consider
this output from Robert's strxfrm test program:

pg(at)hamster:~/code$ ./strxfrm hu_HU.utf8 potyty
"potyty" -> 2826303001090909090109090909 (14 bytes)
pg(at)hamster:~/code$ ./strxfrm hu_HU.utf8 potyta
"potyta" -> 2826302e0c010909090909010909090909 (17 bytes)

This is a very esoteric Hungarian collation rule [1], which at one
point we found we had to plaster over within varstr_cmp() to prevent
indexes giving wrong answers [2]. It turns out that with this
collation, strcoll("potyty", "potty") == 0. The point specifically is
that in principle, collations can alter the number of weights that
appear in the primary level of the blob. This might imply that the
number of primary level bytes for the ASCII-safe string "abcdefgh"
might not equal those of "ijklmnop" for some collation, because of the
application of some similar esoteric rule. In principle, collations
are at liberty to make that happen, even though this hardly ever
occurs in practice (we first heard about it in 2005, although the
Unicode algorithm standard warns of this), and even though any of the
cases where it does occur it probably happens to not affect my little
AC_TRY_RUN program. Even still, I'm not comfortable with the
deficiency of the program. I don't want my optimization to
accidentally not apply just because some hypothetical collation where
this is true was used when Postgres was built. It probably couldn't
happen, but I must admit guaranteeing that it can't is a mess.
I suppose I could fix this by no longer assuming that the number of
bytes that appear in the primary level are fixed at n for n original
ASCII code point strings. I think that in theory even that could
break, though, because we have no principled way of parsing out
different weight levels (the Unicode standard has some ideas about how
given strxfrm()'s "no NULL bytes in blob" restriction, but that's
clearly implementation defined).

Given that Mac OS X is the only platform that appears to have this
header byte problem at all, I think we'd be better off specifically
disabling it on Mac OS X. I was very surprised to learn of the problem
on Mac OS X. Clearly it's going against the grain by having the
problem.

[1] http://www.postgresql.org/message-id/43A16BB7.7030606@mage.hu
[2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-07-28 23:41:57 Reminder: time to stand down from 8.4 maintenance
Previous Message Andres Freund 2014-07-28 23:18:23 Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations