Number of buckets in a hash join

Lists: pgsql-hackers
From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Number of buckets in a hash join
Date: 2013-01-28 10:47:58
Message-ID: 5106575E.4010103@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While testing Alexander's gistchoose patch, "perf report" showed that
the test case spent a surprisingly large amount of CPU time in
ExecScanHashBucket. That function just scans a hash bucket for matches,
and it should be very quick as long as there are not many collisions.

It turns out that the planner estimated the number of rows in the hash
to be much smaller than it actually contained, and the hash table was
initialized with too few buckets as a result. The target is that each
bucket contains 10 tuples (NTUP_PER_BUCKET), but in this case, the
average was about 100.

The first question is, why do we aim at 10 tuples per bucket? My gut
feeling is that that's way too high. I would expect the target to be 1
tuple per bucket, or maybe a little more, like 2-3. Each bucket consumes
one pointer's worth of RAM, which is not much. There's also some
overhead from empty buckets when scanning the hash table, but as long as
all the buckets have at least one entry, there's no gain from having
more than one entry per bucket.

However, lowering NTUP_PER_BUCKET would not have helped in this case,
because we also have a minimum of 1024 buckets. The estimate was so bad
that even after setting NTUP_PER_BUCKET to 1, it was still pegged at
that minimum of 1024 buckets.

Ideally, the planner would always make a good guess the number of rows,
but for the situations that it doesn't, it would be good if the hash
table was enlarged if it becomes too full. It's a well-known technique
to double the size of a hash table once the load factor reaches a
certain threshold, and rehash the existing entries. Another idea is to
just collect all the entries in e.g a linked list when tuples are
inserted to the hash table, and create the buckets lazily, after all the
tuples have been inserted.

Here's an extreme example of this phenomenon. According to perf, about
95% of the CPU time is spent in ExecScanHashBucket. That would be
eliminated by sizing the hash table correctly:

create table numbers (id int4);
insert into numbers select g from generate_series(1, 10000000) g;

explain analyze select * from numbers a, generate_series(1, 100000) b
where b = a.id;
QUERY
PLAN

---------------------------------------------------------------------------------
------------------------------------------------------
Hash Join (cost=22.50..2035430.50 rows=53097600 width=8) (actual
time=32.307..2
9141.348 rows=100000 loops=1)
Hash Cond: (a.id = b.b)
-> Seq Scan on numbers a (cost=0.00..150443.20 rows=10619520
width=4) (actua
l time=0.017..716.503 rows=10000000 loops=1)
-> Hash (cost=10.00..10.00 rows=1000 width=4) (actual
time=32.268..32.268 ro
ws=100000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3516kB
-> Function Scan on generate_series b (cost=0.00..10.00
rows=1000 widt
h=4) (actual time=17.966..22.519 rows=100000 loops=1)
Total runtime: 29146.011 ms
(7 rows)

- Heikki


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Number of buckets in a hash join
Date: 2013-01-28 14:30:33
Message-ID: CA+U5nM+WJaQh3HBEVi-4n+rxu-NUqn0RGZ-gjy9S-zYMvhaWHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28 January 2013 10:47, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> There's also some overhead from empty
> buckets when scanning the hash table

Seems like we should measure that overhead. That way we can plot the
cost against number per bucket, which sounds like it has a minima at
1.0, but that doesn't mean its symmetrical about that point. We can
then see where the optimal setting should be.

Having said that the hash bucket estimate is based on ndistinct, which
we know is frequently under-estimated, so it would be useful to err on
the low side.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Number of buckets in a hash join
Date: 2013-01-28 16:58:05
Message-ID: 17849.1359392285@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> The first question is, why do we aim at 10 tuples per bucket?

I see nothing particularly wrong with that. The problem here is with
having 1000 tuples per bucket.

> Ideally, the planner would always make a good guess the number of rows,
> but for the situations that it doesn't, it would be good if the hash
> table was enlarged if it becomes too full.

Yeah, possibly. The proposed test case actually doesn't behave very
badly if work_mem is small, because there is logic in there to adjust
the number of batches. You didn't say what work_mem you're testing at,
but it's clearly more than the default 1MB. I think the issue arises if
the initial estimate of hashtable size is a good bit less than work_mem,
so the number of buckets is set to something a good bit less than what
would be optimal if we're using more of work_mem. This seems a little
reminiscent of what we did recently in tuplesort to make better use of
work_mem --- in both cases we have to choose a pointer-array size that
will make best use of work_mem after the tuples themselves are added.

regards, tom lane