hash index work

Lists: pgsql-patches
From: Neil Conway <neilc(at)samurai(dot)com>
To: pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: hash index work
Date: 2005-05-13 05:13:48
Message-ID: 4284378C.2080205@samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

This patch implements two changes to hash indexes:

- rather than storing the values of the index keys, we just store their
hash value instead. The hash opclasses have been marked "lossy" so that
the executor will recheck the scan qualifier to catch hash collisions.

- within a page of a hash bucket, the entries are stored sorted by hash
value. When doing a hash index scan, we can then binary search within a
page rather than using a linear search.

Unfortunately, I haven't been able to measure any performance
improvement from either of these changes :-\

I've run tests on two types of tables: int4 keys and relatively short
textual keys (I don't think there's much point testing longer keys: the
patches should obviously be a win there, so I'm concerned about speeding
up the common case). In both cases, the difference in index scan
performance isn't significant: it's about the same with and without the
patches. The indexes have been on tables with 1 million rows (of int4)
and 400,000 rows (of text). I would test with larger tables, but
creating hash indexes is so slow that even these sizes are painful
enough. Since indexes of this size should be cached in memory the
reduction in I/O for the text case isn't being measured (the index is
about 30% smaller than it is when we store the full text value), but
even so I would have expected the binary search to produce noticeable
gains...

Perhaps I've made a coding error that has pessimized the performance
somehow (although nothing obvious shows up in profiles), or perhaps the
linear scan of the pages of the hash bucket isn't a performance problem
in the first place, at least for types with a cheap equality operator.
Some oprofile data seems to support this -- for example, in the int4
case, we spend less than 0.5% of the time in _hash_next and children,
and only 1.8% of the runtime in hashgetmulti and children (with the
vanilla CVS HEAD code).

-Neil

Attachment Content-Type Size
hash_index_opt-10.patch text/x-patch 35.9 KB

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hash index work
Date: 2005-05-28 02:42:28
Message-ID: 200505280242.j4S2gSn09485@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


Neil, I have added these item to the TODO list. Do you plan on applying
this?

---------------------------------------------------------------------------

Neil Conway wrote:
> This patch implements two changes to hash indexes:
>
> - rather than storing the values of the index keys, we just store their
> hash value instead. The hash opclasses have been marked "lossy" so that
> the executor will recheck the scan qualifier to catch hash collisions.
>
> - within a page of a hash bucket, the entries are stored sorted by hash
> value. When doing a hash index scan, we can then binary search within a
> page rather than using a linear search.
>
> Unfortunately, I haven't been able to measure any performance
> improvement from either of these changes :-\
>
> I've run tests on two types of tables: int4 keys and relatively short
> textual keys (I don't think there's much point testing longer keys: the
> patches should obviously be a win there, so I'm concerned about speeding
> up the common case). In both cases, the difference in index scan
> performance isn't significant: it's about the same with and without the
> patches. The indexes have been on tables with 1 million rows (of int4)
> and 400,000 rows (of text). I would test with larger tables, but
> creating hash indexes is so slow that even these sizes are painful
> enough. Since indexes of this size should be cached in memory the
> reduction in I/O for the text case isn't being measured (the index is
> about 30% smaller than it is when we store the full text value), but
> even so I would have expected the binary search to produce noticeable
> gains...
>
> Perhaps I've made a coding error that has pessimized the performance
> somehow (although nothing obvious shows up in profiles), or perhaps the
> linear scan of the pages of the hash bucket isn't a performance problem
> in the first place, at least for types with a cheap equality operator.
> Some oprofile data seems to support this -- for example, in the int4
> case, we spend less than 0.5% of the time in _hash_next and children,
> and only 1.8% of the runtime in hashgetmulti and children (with the
> vanilla CVS HEAD code).
>
> -Neil

>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Neil Conway <neilc(at)samurai(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hash index work
Date: 2005-05-28 13:14:38
Message-ID: 42986EBE.4070904@samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Bruce Momjian wrote:
> Neil, I have added these item to the TODO list. Do you plan on applying
> this?

No, I don't have any immediate plans to apply it, as unfortunately I
didn't see a performance win :-( It's also possible I'm just not
measuring the right workload, although I don't have time to rerun the
benchmarks at the moment.

The patch does two things: (1) change hash indexes to only store the
key's hash value, not the entire key (2) store index elements within a
hash bucket in order of hash key and search for matches via binary
search. #1 is definitely a win in some in some circumstances (e.g.
indexing large fields or types with expensive equality operators), but
those aren't the common case. I'm surprised that #2 is not a more
noticeable improvement...

One possibility would be to provide an optional implementation of #1,
perhaps via an alternate index operator class. That way people could
select it in those situations in which it is worth using.

-Neil


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hash index work
Date: 2005-05-28 16:03:38
Message-ID: 9723.1117296218@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Neil Conway <neilc(at)samurai(dot)com> writes:
> The patch does two things: (1) change hash indexes to only store the
> key's hash value, not the entire key (2) store index elements within a
> hash bucket in order of hash key and search for matches via binary
> search. #1 is definitely a win in some in some circumstances (e.g.
> indexing large fields or types with expensive equality operators), but
> those aren't the common case. I'm surprised that #2 is not a more
> noticeable improvement...

It occurs to me that change #1 makes it cheaper to skip over index
entries that are in the same bucket but don't have the exact same hash
code; but it makes it significantly more expensive to skip over entries
that have the same hash code but aren't really equal to the key being
sought (since you have to visit the heap to find out they aren't equal).
Maybe your test workload had enough occurrences of the latter case to
balance out the win from the former.

I think it would be worth investigating a variant in which the index
stores both the hash code and the key value. This allows cheap
elimination of non-matching hash codes, and binary sorting of index
entries, without adding any extra trips to the heap. The downside is
that it makes the index larger so you incur more I/O there --- so this
might not be a win either.

> One possibility would be to provide an optional implementation of #1,
> perhaps via an alternate index operator class. That way people could
> select it in those situations in which it is worth using.

I think it would definitely be a good idea to make the lossy behavior
optional.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Neil Conway <neilc(at)samurai(dot)com>, pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hash index work
Date: 2005-06-25 00:03:08
Message-ID: 200506250003.j5P038621441@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


I have added these TODO items based on Neil's ideas:

* Consider sorting hash buckets so entries can be found using a binary
search, rather than a linear scan
* In hash indexes, consider storing the hash value with or instead
of the key itself

However, Neil's tests don't show much of a win. Should I keep these
items on the TODO list?

---------------------------------------------------------------------------

Tom Lane wrote:
> Neil Conway <neilc(at)samurai(dot)com> writes:
> > The patch does two things: (1) change hash indexes to only store the
> > key's hash value, not the entire key (2) store index elements within a
> > hash bucket in order of hash key and search for matches via binary
> > search. #1 is definitely a win in some in some circumstances (e.g.
> > indexing large fields or types with expensive equality operators), but
> > those aren't the common case. I'm surprised that #2 is not a more
> > noticeable improvement...
>
> It occurs to me that change #1 makes it cheaper to skip over index
> entries that are in the same bucket but don't have the exact same hash
> code; but it makes it significantly more expensive to skip over entries
> that have the same hash code but aren't really equal to the key being
> sought (since you have to visit the heap to find out they aren't equal).
> Maybe your test workload had enough occurrences of the latter case to
> balance out the win from the former.
>
> I think it would be worth investigating a variant in which the index
> stores both the hash code and the key value. This allows cheap
> elimination of non-matching hash codes, and binary sorting of index
> entries, without adding any extra trips to the heap. The downside is
> that it makes the index larger so you incur more I/O there --- so this
> might not be a win either.
>
> > One possibility would be to provide an optional implementation of #1,
> > perhaps via an alternate index operator class. That way people could
> > select it in those situations in which it is worth using.
>
> I think it would definitely be a good idea to make the lossy behavior
> optional.
>
> regards, tom lane
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073