Re: B-Tree support function number 3 (strxfrm() optimization)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-08-09 00:11:15
Message-ID: CAM3SWZSMA5moFmb0xgr3x-1dOoFtyCXiXDBiH2vBp=eFz_Mtow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Robert pointed out a case where varying character case of an English
> word did not alter the primary level bytes (and thus the poor man's
> normalized key was the same). He acknowledged that most of the entropy
> of the first 8 bytes of the string went into the first 8 bytes of the
> blob/key. This can actually be very useful to the optimization in some
> cases. In particular, with most Latin alphabets you'll see the same
> pattern when diacritics are used. This means that even though the
> original string has (say) an accented character that would take 2
> bytes to store in UTF-8, the weight in the primary level is the same
> as an equivalent unaccented character (and so only takes one byte to
> store at that level, with differences only in subsequent levels).
> Whole strxfrm() blobs are the same length regardless of how many
> accents appear in otherwise equivalent original Latin string, and you
> get exactly the same high concentrations of entropy in the first 8
> bytes in pretty much all Latin alphabets (the *absence* of diacritics
> is stored in later weight levels too, even with the "en_US.UTF-8"
> collation).

There are many other interesting cases where en_US.UTF-8, and
presumably all other collations concentrate much more entropy into
leading bytes than might be expected. Consider:

pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "abc"
"abc" -> 0c0d0e0109090901090909 (11 bytes)
pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# abc"
"# abc" -> 0c0d0e01090909010909090101760135 (16 bytes)
pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** abc"
"***** abc" -> 0c0d0e010909090109090901017301730173017301730135 (24 bytes)

and, to show you what this looks like when the primary
weights/original codepoints appear backwards:

pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "cba"
"cba" -> 0e0d0c0109090901090909 (11 bytes)
pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# cba"
"# cba" -> 0e0d0c01090909010909090101760135 (16 bytes)
pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** cba"
"***** cba" -> 0e0d0c010909090109090901017301730173017301730135 (24 bytes)

Evidently, the implementation always places primary weights first,
corresponding to "abc" (and later "cba") - the bytes "\0c\0d\0e" (and
later "\0e\0d\0c") - no matter how many "extraneous" characters are
placed in front. They're handled later. Space don't appear in the
primary weight level at all:

pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "a b c"
"a b c" -> 0c0d0e01090909010909090102350235 (16 bytes)

Lots of punctuation-type characters will not affect the primary weight level:

pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "%(at)!()\/#-+,:^~? a b c"
"%(at)!()\/#-+,:^~? a b c" ->
0c0d0e0109090901090909010177015d013e01500152017401420176013a0178013b013d011201170140013502350235
(48 bytes)

Some non-alphabetic ASCII characters will affect the primary level,
though. For example:

pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "1.) a b c"
"1.) a b c" -> 030c0d0e010909090901090909090102440152013502350235 (25 bytes)

There is one extra byte here, in front of the "abc" bytes "\0c\0d\0e",
a primary weight "\03" corresponding to '1'.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-08-09 00:13:33 Re: Defining a foreign key with a duplicate column is broken
Previous Message Larry White 2014-08-08 23:19:02 Re: jsonb format is pessimal for toast compression