Making hash indexes worthwhile

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Making hash indexes worthwhile
Date: 2009-10-05 00:41:17
Message-ID: f67928030910041741n7f9b6bd8u9ef3b9fa0852ba6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I see that the docs were recently changed from discouraging hash
indexes both because they were no known uses in which they were
meaningfully better than btree, and because they aren't recoverable;
to now just discouraging them because they are not recoverable. Does
that mean there are now uses in which hash indexes are provably better
than btree if one is willing to overlook the lack of recoverability?
If so, what are those situations?

I've played around a bit with hash indexes, and it seems to me that
making them generally worthwhile will take (at least) reducing or
entirely doing away with the heavy-wait locks.

I have a few ideas, which are not necessarily independent of each other.

===metablock===
If each bucket knows how many bits of the hash-value are significant
for the entries in that bucket, then there will be a way to detect
races from the metablock to the main bucket block. Rather than
chaining locks from the metablock to the bucket, the process can
compute the bucket and remember the number of significant bits in that
bucket according to the metablock, then release the metablock lock
before attempting to acquire the the buck lock. When the bucket lock
is obtained, if the number of bits is greater than the number
remembered from the metablock, then it can drop the bucket lock and
try again.

Once you don't need to chain locks from the metablock to the bucket,
the lock on the metablock can probably be lowered to just a buffer
content lock rather than a heavy lock. Even better, maybe the
metablock can be cached in the relcache. Since the changes to the
metablock seem to be entirely unidirectional (other than things like
index dropping and rebuilding, which I think maybe the normal relcache
flushing mechanisms will take care of), it should always be possible
to detect when you have a too-stale copy and have to go get the
current one.

===get rid of heavy locks===
While there are several potential places for deadlock, it seems like
the main one is that the hash index scan code returns control to the
executor while holding locks on the hash buckets/pages, so you can get
hybrid deadlocks where one side of the blockage is in hash-code space
and the other in executor-space. The obvious way to fix this would be
to read the entire bucket worth of qualifying tids into local memory
and release the bucket lock before returning control to the executor,
kind of like btree does with pages. the main problem here is that a
logical bucket can have an unbounded number of overflow pages. So if
a bucket has too many tids with the sought-for full hash value (more
than will fit in work_mem), we would have to prepared to spill to
disk. Is that an acceptable trade-off? Obviously there are a lot of
subtleties to work out with insertions and bucket-splits, but if
having read-only scans occasionally need to spill to disk is a no-go
then there is no point in trying to work the other things out.

Thanks for any feedback,

Jeff

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2009-10-05 00:54:56 Re: Rules: A Modest Proposal
Previous Message Robert Haas 2009-10-05 00:34:12 Re: Rules: A Modest Proposal