Re: Hash index todo list item

From: Brian Hurt <bhurt(at)janestcapital(dot)com>
To: Kenneth Marshall <ktm(at)rice(dot)edu>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Hash index todo list item
Date: 2007-09-07 15:08:13
Message-ID: 46E1695D.5000909@janestcapital.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Kenneth Marshall wrote:

>>>
>>>
>>>
>>How likely is it that you will get a hash collision, two strings that are
>>different that will hash to the same value? To avoid this requires a very
>>large hash key (128 bits, minimum)- otherwise you get into birthday attack
>>problems. With a 32-bit hash, the likelyhood is greater than 50% that two
>>strings in a collection of 100,000 will hash to the same value. With a
>>64-bit hash, the likelyhood is greater than 50% that two strings in a
>>collection of 10 billion will has to same value. 10 billion is a large
>>number, but not an unreasonable number, of strings to want to put into a
>>hash table- and it's exactly this case where the O(1) cost of hashtables
>>starts being a real win.
>>
>>Brian
>>
>>
>>
>Yes, there is a non-negligible chance of collision (In a DB is there
>any chance that is non-negligible? :) ) and the values must be checked
>against the actual. The win is the collapse of the index size and only
>needed to check a small fraction of the actual tuples.
>
>
>
>

Ah, OK- I misunderstood you. I thought you were saying that the hash
values would need to be unique, and you wouldn't check the original
values at all. My bad.

Brian

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Page 2007-09-07 15:14:06 Re: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Previous Message Markus Schiltknecht 2007-09-07 15:01:08 terms for database replication: synchronous vs eager