Re: Making hash indexes worthwhile

Lists: pgsql-hackers
From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Making hash indexes worthwhile
Date: 2009-10-05 00:41:17
Message-ID: f67928030910041741n7f9b6bd8u9ef3b9fa0852ba6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I see that the docs were recently changed from discouraging hash
indexes both because they were no known uses in which they were
meaningfully better than btree, and because they aren't recoverable;
to now just discouraging them because they are not recoverable. Does
that mean there are now uses in which hash indexes are provably better
than btree if one is willing to overlook the lack of recoverability?
If so, what are those situations?

I've played around a bit with hash indexes, and it seems to me that
making them generally worthwhile will take (at least) reducing or
entirely doing away with the heavy-wait locks.

I have a few ideas, which are not necessarily independent of each other.

===metablock===
If each bucket knows how many bits of the hash-value are significant
for the entries in that bucket, then there will be a way to detect
races from the metablock to the main bucket block. Rather than
chaining locks from the metablock to the bucket, the process can
compute the bucket and remember the number of significant bits in that
bucket according to the metablock, then release the metablock lock
before attempting to acquire the the buck lock. When the bucket lock
is obtained, if the number of bits is greater than the number
remembered from the metablock, then it can drop the bucket lock and
try again.

Once you don't need to chain locks from the metablock to the bucket,
the lock on the metablock can probably be lowered to just a buffer
content lock rather than a heavy lock. Even better, maybe the
metablock can be cached in the relcache. Since the changes to the
metablock seem to be entirely unidirectional (other than things like
index dropping and rebuilding, which I think maybe the normal relcache
flushing mechanisms will take care of), it should always be possible
to detect when you have a too-stale copy and have to go get the
current one.

===get rid of heavy locks===
While there are several potential places for deadlock, it seems like
the main one is that the hash index scan code returns control to the
executor while holding locks on the hash buckets/pages, so you can get
hybrid deadlocks where one side of the blockage is in hash-code space
and the other in executor-space. The obvious way to fix this would be
to read the entire bucket worth of qualifying tids into local memory
and release the bucket lock before returning control to the executor,
kind of like btree does with pages. the main problem here is that a
logical bucket can have an unbounded number of overflow pages. So if
a bucket has too many tids with the sought-for full hash value (more
than will fit in work_mem), we would have to prepared to spill to
disk. Is that an acceptable trade-off? Obviously there are a lot of
subtleties to work out with insertions and bucket-splits, but if
having read-only scans occasionally need to spill to disk is a no-go
then there is no point in trying to work the other things out.

Thanks for any feedback,

Jeff


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making hash indexes worthwhile
Date: 2009-10-05 02:13:28
Message-ID: 17307.1254708808@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> I see that the docs were recently changed from discouraging hash
> indexes both because they were no known uses in which they were
> meaningfully better than btree, and because they aren't recoverable;
> to now just discouraging them because they are not recoverable. Does
> that mean there are now uses in which hash indexes are provably better
> than btree if one is willing to overlook the lack of recoverability?
> If so, what are those situations?

One reason is that you can index values that are too big for btrees;
since hash indexes now store only the hash codes, they don't have a hard
length limit on the underlying value. I'm not sure how useful that
really is in practice, but it's at least an argument for considering
them in certain cases.

> I've played around a bit with hash indexes, and it seems to me that
> making them generally worthwhile will take (at least) reducing or
> entirely doing away with the heavy-wait locks.

Concurrency is really the least of the issues for hash indexes. So far
it's not clear that they're fast enough even in sequential use ...

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making hash indexes worthwhile
Date: 2009-10-05 06:05:01
Message-ID: f67928030910042305t3e49080cy905beeca9999bf01@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Oct 4, 2009 at 7:13 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>
>> I've played around a bit with hash indexes, and it seems to me that
>> making them generally worthwhile will take (at least) reducing or
>> entirely doing away with the heavy-wait locks.
>
> Concurrency is really the least of the issues for hash indexes.  So far
> it's not clear that they're fast enough even in sequential use ...

Do you know why that should be? I've done some work with gprof, and
the results are pretty suspect, because the total gprof time adds up
to only about 1/3 of the total time the backend spends on CPU
(according to "top"), and I don't know where the unaccounted for time
is going. But a good part of the accounted-for time seems to be
associated with the locking system, even when there is only one active
backend.

Cheers,

Jeff


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making hash indexes worthwhile
Date: 2009-10-05 13:49:43
Message-ID: 6429.1254750583@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> Do you know why that should be? I've done some work with gprof, and
> the results are pretty suspect, because the total gprof time adds up
> to only about 1/3 of the total time the backend spends on CPU
> (according to "top"), and I don't know where the unaccounted for time
> is going.

Are you sure that gprof is delivering trustworthy numbers at all?
I've seen cases where it was consistently mis-timing things.
https://bugzilla.redhat.com/show_bug.cgi?id=151763
Admittedly that was an old Linux version, but ...

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making hash indexes worthwhile
Date: 2009-10-07 00:29:48
Message-ID: f67928030910061729k6395099erebbcda9189720b55@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 5, 2009 at 6:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> Do you know why that should be?  I've done some work with gprof, and
>> the results are pretty suspect, because the total gprof time adds up
>> to only about 1/3 of the total time the backend spends on CPU
>> (according to "top"), and I don't know where the unaccounted for time
>> is going.
>
> Are you sure that gprof is delivering trustworthy numbers at all?
> I've seen cases where it was consistently mis-timing things.
> https://bugzilla.redhat.com/show_bug.cgi?id=151763
> Admittedly that was an old Linux version, but ...

That bug goes in the other direction versus what I am seeing,
reporting more time than the clock on the wall would allow. I think
things are more or less good, because profiling simple programs gives
the expected answers, where under that bug they wouldn't. I think
there are some kinds of system calls which are accounted for as
user-space by "top", but which are not interruptable and so don't get
timed by gprof. But are such events evenly distributed throughout the
code or concentrated in certain places? I'll need to experiment with
it a bit more.

Any case, I hacked the hash index code to not take heavy locks on
meta-block or on bucket-blocks (lw bufffer content locks are still
taken) and the performance in a nested loop (full table scan outer,
hash-index inner scan) went up by over 50%, with only one backend
active. And the heavy-weight locking code dropped out of the top
spots in gprof. So I think I may be on to something. I want to run
it for many more hours to make sure I'm getting good statistics.

Then it is a question of whether my other ideas for making it safe to
remove the heavy weight locks on a non-read-only system can be
implemented.

Jeff


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making hash indexes worthwhile
Date: 2009-10-31 00:02:59
Message-ID: f67928030910301702k5a8c5fb9s77380aeae42cbe1f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Oct 4, 2009 at 7:13 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> I see that the docs were recently changed from discouraging hash
>> indexes both because they were no known uses in which they were
>> meaningfully better than btree, and because they aren't recoverable;
>> to now just discouraging them because they are not recoverable.  Does
>> that mean there are now uses in which hash indexes are provably better
>> than btree if one is willing to overlook the lack of recoverability?
>> If so, what are those situations?
>
> One reason is that you can index values that are too big for btrees;
> since hash indexes now store only the hash codes, they don't have a hard
> length limit on the underlying value.  I'm not sure how useful that
> really is in practice, but it's at least an argument for considering
> them in certain cases.

I haven't tested this situation at all yet, hoping to show that there
is substantially less restrictive use case under which they can be
made to work well.

>
>> I've played around a bit with hash indexes, and it seems to me that
>> making them generally worthwhile will take (at least) reducing or
>> entirely doing away with the heavy-wait locks.
>
> Concurrency is really the least of the issues for hash indexes.  So far
> it's not clear that they're fast enough even in sequential use ...

Hash index performance came up again in another thread, so I wanted to
come back to this and report my findings.

I've done testing using select-only statements with either btree or
hash index on pgbench_accounts.aid, using only one active backend.
For all tests, CPU usage was the bottleneck, essentially 100%.

I used 600MB shared_buffers and -s 30. I figure that if each
table-page-fetch is a physical IO, then that will dominate the
performance, so the index's nature is irrelevant. So if the hash
index has a strength it will best show if the table fits in memory.
(The alternative would be if the table is so large that not even the
btree's index pages fit in memory, resulting in IO for those index
pages as well as table's, but I didn't have the patience or adequate
IO subsystem to set up a test for that)

With pgbench -S (select only, one tuple at a time) the overhead of
communication between pgbench and the backend dominated the run time.
Moving the select loop into a pl/pgsql routine did not help much. So
I made a temp table with 10,000 randomly selected ids (from 1 to
3,000,000) and used that as the outer member to drive an nested loop
join against the pgbench_accounts table, doing a custom query of
select sum(abalance) from pgbench_accounts, foo where aid=random_id

The HEAD code could do 16-17 tps (160,000 to 170,000 inner loop tuple
fetches per second) using hash index, and 18 tps with btree. (But
occasionally the rank order would be reversed, out of many alternating
runs of reindexing, vacuuming, then doing pgbench -T 600 -f
nested_loop.sql). When I disabled the lmgr page locking of the hash
index, by making it think it was a temporary relation, the tps for the
hash index jumped to 24 tps.

If I make all the aid pgbench_accounts be even, and load "foo" with
only odd IDs so that no rows are actually found, then the speed of the
hash index jumps to about 48 tps, so about half the remaining CPU
usage with hash index pertains to dealing with the rows once they are
found, and not with using the index to attempt to find them. (I don't
know how much of this time is used to recheck the qual, which btree
doesn't need to do.)

So my conclusions are:

1) btree indexes are so good at equality lookup, there may not be much
room for hash indexes to improve on it by a substantial amount.

2) Heavy-weight locks are called that for a reason, they use a lot of
CPU even without contention.

3) The CPU usage of the hash-index code proper is quite small, with
more time being spent in heavy-weight PageLocks (specific to hash
indexes, but not part of the hash index code itself) and in executor
code common to all index methods, than in the hash index code. Based
on this, I don't think that changing the bucket size to be smaller
than a 8K block size is going to make much difference in hash index
performance.

Thinking about how to get rid of heavy weight locks and add WAL, I was
wondering if we couldn't get rid of bucket splits altogether,
requiring the bucket count to be set immutably at build time based on
the estimated eventual max/equilibrium size of the table. Maybe this
would render hash indexes somewhat of a niche feature, but probably
much less so than they currently are. I think that getting rid of
dynamic bucket splits would make it easier to get rid of heavy-weight
locks and also much easier to implement WAL. WAL logging the addition
of an overflow page seems about the same as a btree node split, while
WAL logging a bucket split, when the bucket potentially has overflow
pages in it, seems much much harder.

Cheers,

Jeff


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making hash indexes worthwhile
Date: 2009-10-31 04:25:25
Message-ID: 28292.1256963125@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> So my conclusions are:

> 2) Heavy-weight locks are called that for a reason, they use a lot of
> CPU even without contention.

> 3) The CPU usage of the hash-index code proper is quite small, with
> more time being spent in heavy-weight PageLocks (specific to hash
> indexes, but not part of the hash index code itself) and in executor
> code common to all index methods, than in the hash index code.

Interesting. My reaction to that would be to try to replace the
heavyweight locks with LWLocks. IIRC the reason for using heavyweight
locks was fear of deadlocks, but maybe closer analysis and some tweaking
would allow us to eliminate that risk.

regards, tom lane