Re: [PATCHES] updated hash functions for postgresql v1

Lists: pgsql-hackers
From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] updated hash functions for postgresql v1
Date: 2008-11-04 20:26:55
Message-ID: 20081104202655.GP18362@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry about the delay for this update to the new hash
index implementation. I was trying to get the WAL logging
in place and forgot to post the actual patch. The WAL
for hash indexes will need to wait for 8.5, but I did
want to add back in the piece of the Bob Jenkins 2006
hash function that was stripped out of the initial
patch on application due to concerns about the randomness
of the resulting hash values. Here is a re-post of my
initial findings comparing the old/new Jenkins hash
from lookup2 and lookup3. I have added a third column
containing the results for the hash_any() resulting
from the attached patch as well as simple timing test
for a DB reindex both before and after patching.

Also attached is a simple documentation patch updating
the note attached to the hash index description.

Regards,
Ken
----------------------------------------------------
Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash input old new newv2
---------- --- --- -----
cracklib 338 316 338
cracklib x 2 (i.e. clibclib) 305 319 300
cracklib x 3 (clibclibclib) 323 329 315
cracklib x 10 302 310 329
cracklib x 100 350 335 298
cracklib x 1000 314 309 315
cracklib x 100 truncated to char(100) 311 327 320

uint32 from 1-1648379 309 319 347
(uint32 1-1948379)*256 309 314 304
(uint32 1-1948379)*16 310 314 324
"a"uint32 (i.e. a00001,a0002...) 320 321 312

uint32uint32 (i.e. uint64) 321 287 309

The different result columns are old = Jenkins 1996 hash
function(lookup2.c), new = Jenkins 2006 hash function
(lookup3.c), and newv2 = adaptation of current hash_any()
to incorporate the separate mix()/final() functions. As
you can see from the results, spliting the mix() and final()
apart does not result in any perceptible loss of randomness
in the hash assignment. I also ran a crude timing for a
reindex of the following database:

CREATE TABLE dict (word text);
CREATE INDEX wordhash ON dict USING hash (word);
INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
... (21 times)

REINDEX TABLE
...

The average time to reindex the table using our current
hash_any() without the separate mix()/final() was 1696ms
and 1482ms with the separate mix()/final() stages giving
almost 13% better performance for this stupid metric.

Attachment Content-Type Size
indices.sgml.patch text/plain 819 bytes
hashfunc.c.patch text/plain 6.0 KB

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kenneth Marshall <ktm(at)rice(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] updated hash functions for postgresql v1
Date: 2008-11-04 20:32:47
Message-ID: Pine.LNX.4.64.0811042330110.15810@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Just interested if you repeat your tests not with cracklib-dict,
but using 8-bit words. From our experience we found many hash functions
are optimized for 7-bit words and produce too many collisions
for 8-bit words. That's why we use crc32.

Oleg
On Tue, 4 Nov 2008, Kenneth Marshall wrote:

> Sorry about the delay for this update to the new hash
> index implementation. I was trying to get the WAL logging
> in place and forgot to post the actual patch. The WAL
> for hash indexes will need to wait for 8.5, but I did
> want to add back in the piece of the Bob Jenkins 2006
> hash function that was stripped out of the initial
> patch on application due to concerns about the randomness
> of the resulting hash values. Here is a re-post of my
> initial findings comparing the old/new Jenkins hash
> from lookup2 and lookup3. I have added a third column
> containing the results for the hash_any() resulting
> from the attached patch as well as simple timing test
> for a DB reindex both before and after patching.
>
> Also attached is a simple documentation patch updating
> the note attached to the hash index description.
>
> Regards,
> Ken
> ----------------------------------------------------
> Hi,
>
> I have finally had a chance to do some investigation on
> the performance of the old hash mix() function versus
> the updated mix()/final() in the new hash function. Here
> is a table of my current results for both the old and the
> new hash function. In this case cracklib refers to the
> cracklib-dict containing 1648379 unique words massaged
> in various ways to generate input strings for the hash
> functions. The result is the number of collisions in the
> hash values generated.
>
> hash input old new newv2
> ---------- --- --- -----
> cracklib 338 316 338
> cracklib x 2 (i.e. clibclib) 305 319 300
> cracklib x 3 (clibclibclib) 323 329 315
> cracklib x 10 302 310 329
> cracklib x 100 350 335 298
> cracklib x 1000 314 309 315
> cracklib x 100 truncated to char(100) 311 327 320
>
> uint32 from 1-1648379 309 319 347
> (uint32 1-1948379)*256 309 314 304
> (uint32 1-1948379)*16 310 314 324
> "a"uint32 (i.e. a00001,a0002...) 320 321 312
>
> uint32uint32 (i.e. uint64) 321 287 309
>
> The different result columns are old = Jenkins 1996 hash
> function(lookup2.c), new = Jenkins 2006 hash function
> (lookup3.c), and newv2 = adaptation of current hash_any()
> to incorporate the separate mix()/final() functions. As
> you can see from the results, spliting the mix() and final()
> apart does not result in any perceptible loss of randomness
> in the hash assignment. I also ran a crude timing for a
> reindex of the following database:
>
> CREATE TABLE dict (word text);
> CREATE INDEX wordhash ON dict USING hash (word);
> INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
> INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
> ... (21 times)
>
> REINDEX TABLE
> ...
>
> The average time to reindex the table using our current
> hash_any() without the separate mix()/final() was 1696ms
> and 1482ms with the separate mix()/final() stages giving
> almost 13% better performance for this stupid metric.
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] updated hash functions for postgresql v1
Date: 2008-11-04 21:15:44
Message-ID: 20081104211544.GQ18362@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote:
> Just interested if you repeat your tests not with cracklib-dict,
> but using 8-bit words. From our experience we found many hash functions
> are optimized for 7-bit words and produce too many collisions
> for 8-bit words. That's why we use crc32.
>
> Oleg

I think that the lines:

uint32 from 1-1648379 309 319 347
(uint32 1-1948379)*256 309 314 304
(uint32 1-1948379)*16 310 314 324
"a"uint32 (i.e. a00001,a0002...) 320 321 312

uint32uint32 (i.e. uint64) 321 287 309

can count as 8-bit words if taken a byte at a time. In fact
that is how hash_any() treats them, as a character string
and a length. One of the design goals of the original 1997
hash function in lookup2 and the 2006 update in lookup3 is
to support keys of arbitrary arrangements of bits. I can run
any additional checks that you want since the test harness
is perl with Inline::C. If you are using crc32 his article in
Dr. Dobbs shows that CRC has a "2 into 1" funnel-15 and an
"11 into 10" funnel-100 unless you are using a generalized
CRC. Also, unless you can inline your CRC the Jenkins lookup3
is 5n+20 where CRC is 9n+3.

Regards,
Ken

> On Tue, 4 Nov 2008, Kenneth Marshall wrote:
>
>> Sorry about the delay for this update to the new hash
>> index implementation. I was trying to get the WAL logging
>> in place and forgot to post the actual patch. The WAL
>> for hash indexes will need to wait for 8.5, but I did
>> want to add back in the piece of the Bob Jenkins 2006
>> hash function that was stripped out of the initial
>> patch on application due to concerns about the randomness
>> of the resulting hash values. Here is a re-post of my
>> initial findings comparing the old/new Jenkins hash
>> from lookup2 and lookup3. I have added a third column
>> containing the results for the hash_any() resulting
>> from the attached patch as well as simple timing test
>> for a DB reindex both before and after patching.
>>
>> Also attached is a simple documentation patch updating
>> the note attached to the hash index description.
>>
>> Regards,
>> Ken
>> ----------------------------------------------------
>> Hi,
>>
>> I have finally had a chance to do some investigation on
>> the performance of the old hash mix() function versus
>> the updated mix()/final() in the new hash function. Here
>> is a table of my current results for both the old and the
>> new hash function. In this case cracklib refers to the
>> cracklib-dict containing 1648379 unique words massaged
>> in various ways to generate input strings for the hash
>> functions. The result is the number of collisions in the
>> hash values generated.
>>
>> hash input old new newv2
>> ---------- --- --- -----
>> cracklib 338 316 338
>> cracklib x 2 (i.e. clibclib) 305 319 300
>> cracklib x 3 (clibclibclib) 323 329 315
>> cracklib x 10 302 310 329
>> cracklib x 100 350 335 298
>> cracklib x 1000 314 309 315
>> cracklib x 100 truncated to char(100) 311 327 320
>>
>> uint32 from 1-1648379 309 319 347
>> (uint32 1-1948379)*256 309 314 304
>> (uint32 1-1948379)*16 310 314 324
>> "a"uint32 (i.e. a00001,a0002...) 320 321 312
>>
>> uint32uint32 (i.e. uint64) 321 287 309
>>
>> The different result columns are old = Jenkins 1996 hash
>> function(lookup2.c), new = Jenkins 2006 hash function
>> (lookup3.c), and newv2 = adaptation of current hash_any()
>> to incorporate the separate mix()/final() functions. As
>> you can see from the results, spliting the mix() and final()
>> apart does not result in any perceptible loss of randomness
>> in the hash assignment. I also ran a crude timing for a
>> reindex of the following database:
>>
>> CREATE TABLE dict (word text);
>> CREATE INDEX wordhash ON dict USING hash (word);
>> INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
>> INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
>> ... (21 times)
>>
>> REINDEX TABLE
>> ...
>>
>> The average time to reindex the table using our current
>> hash_any() without the separate mix()/final() was 1696ms
>> and 1482ms with the separate mix()/final() stages giving
>> almost 13% better performance for this stupid metric.
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] updated hash functions for postgresql v1
Date: 2008-11-04 21:21:26
Message-ID: 20081104212126.GR18362@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Oleg,

Here is a little more information on the use of CRC32 as
a hash function, with some warning caveats:

http://home.comcast.net/~bretm/hash/8.html

Regards,
Ken

On Tue, Nov 04, 2008 at 03:15:44PM -0600, Kenneth Marshall wrote:
> On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote:
> > Just interested if you repeat your tests not with cracklib-dict,
> > but using 8-bit words. From our experience we found many hash functions
> > are optimized for 7-bit words and produce too many collisions
> > for 8-bit words. That's why we use crc32.
> >
> > Oleg
>
> I think that the lines:
>
> uint32 from 1-1648379 309 319 347
> (uint32 1-1948379)*256 309 314 304
> (uint32 1-1948379)*16 310 314 324
> "a"uint32 (i.e. a00001,a0002...) 320 321 312
>
> uint32uint32 (i.e. uint64) 321 287 309
>
> can count as 8-bit words if taken a byte at a time. In fact
> that is how hash_any() treats them, as a character string
> and a length. One of the design goals of the original 1997
> hash function in lookup2 and the 2006 update in lookup3 is
> to support keys of arbitrary arrangements of bits. I can run
> any additional checks that you want since the test harness
> is perl with Inline::C. If you are using crc32 his article in
> Dr. Dobbs shows that CRC has a "2 into 1" funnel-15 and an
> "11 into 10" funnel-100 unless you are using a generalized
> CRC. Also, unless you can inline your CRC the Jenkins lookup3
> is 5n+20 where CRC is 9n+3.
>
> Regards,
> Ken