Quick Links

Re: Hash Join cost estimates

Lists:	pgsql-hackers

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Hash Join cost estimates
Date:	2013-03-28 23:56:27
Message-ID:	20130328235627.GV4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

All,

Marty's issue w/ NestLoop reminded me that I'd put together a test
case which illustrates a HashJoin doing the wrong thing.

The test case is here:

http://snowman.net/~sfrost/test_case.sql

Simply load that up and then check out this plan:

explain select * from small_table join big_table using (id_short);

We end up deciding to build a hash table of 4M records and then seq
scan the table that only has 41K. Much of the reason for this is
because the analytics point out, correctly, that the column we're
using from the small table to join isn't unique and therefore the
buckets will be deeper- but it's not *nearly* as bad as we estimate.

The bucket estimate for the small table comes back as 26, while the
reality is that we only look through ~5.5 entries per bucket with
the longest run being 21. With the big table being hashed, the bucket
estimate is 4 (though this seem to vary way more than I would expect,
sometimes 4, sometimes 8, sometimes 2..) while the average number
scanned through is actually ~3.5 and the longest scan ends up being
20.

Without the bucket question, we end up with pretty reasonable results
(directly out of initial_cost_hashjoin):

41K hashed, seqscan 4M:
startup_cost = 1229.46
run_cost = 72307.39

4M hashed, 41K seqscan:
startup_cost = 115030.10
run_cost = 817.70

When we get through dealing with the bucketing question in
final_cost_hashjoin (costsize.c:2673), we have some pretty gross
results for the 'good' plan's run_cost:

run_cost = 72307.39 + 138848.81 = 211156.20

While the 'bad' plan's run_cost is only bumped a tiny bit:

run_cost = 817.7 + 411.76 = 1229.46

Resulting in total costs that look all kinds of wacky:

41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66

Or the 'good' plan being costed at *nearly double*. Now, my laptop
might not be the 'best' system CPU wise, but there's a pretty massive
time difference between these plans:

41K hashed, seqscan 4M: 2252.007 ms http://explain.depesz.com/s/FEq
4M hashed, seqscan 41K: 2708.471 ms http://explain.depesz.com/s/FOU

That's half a second and a good 15+% difference.

Now, back to the bucket estimation- the ndistinct for the small table
is -0.475058 (or 19561 tuples), which is about right. There are 23
distinct values, 18,795 duplicated, and another 841 with dup counts
ranging from 3 to 10. This leads to an avgfreq of 0.000051,
unfortunately, we're going for only 8192 buckets, so this gets moved
up to 0.00012 and then the adjustment for MCV (which is 0.00027) bumps
this all the way up to 0.00064, leading to our bucket depth estimate
of 26 (bucket size estimate multiplied by the inner rows) and the
resulting cost dominating the overall costing.

If we consider the bucketing wrong and look at what the costs would
have been with the actual number of average scans (and therefore
excluding the 0.5 'adjustment'), we get:

seqscan 41K cost: 360.28 (cpu_op_cost * 41K * 3.5)
seqscan 4M cost: 58743.73 (cpu_op_cost * 4M * 5.5)

which isn't exactly going in the 'right' direction for us. Now, I'm
sure that there's a cost to having to consider more buckets, but it
seems to be far less, in comparison to the hash table creation cost,
than what our model would suggest.

In the end, I think the problem here is that we are charging far too
much for these bucket costs (particularly when we're getting them so
far wrong) and not nearly enough for the cost of building the hash
table in the first place.

Thoughts? Ideas about how we can 'fix' this? Have others run into
similar issues?

Thanks,

Stephen

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-03-29 20:23:11
Message-ID:	1364588591.1187.46.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote:
> 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
> 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66

I think those are backwards -- typo?

> In the end, I think the problem here is that we are charging far too
> much for these bucket costs (particularly when we're getting them so
> far wrong) and not nearly enough for the cost of building the hash
> table in the first place.
>
> Thoughts? Ideas about how we can 'fix' this? Have others run into
> similar issues?

Yes, I have run into this issue (or something very similar). I don't
understand why the bucketsize even matters much -- assuming few hash
collisions, we are not actually evaluating the quals any more times than
necessary. So why all of the hashjoin-specific logic in determining the
number of qual evaluations? The only reason I can think of is to model
the cost of comparing the hashes themselves.

Also, searching the archives turns up at least one other, but I think
I've seen more:

http://www.postgresql.org/message-id/A82128A6-4E3B-43BD-858D-21B129F7BEEB@richrelevance.com

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-03-29 20:37:25
Message-ID:	2942.1364589445@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> Yes, I have run into this issue (or something very similar). I don't
> understand why the bucketsize even matters much -- assuming few hash
> collisions, we are not actually evaluating the quals any more times than
> necessary. So why all of the hashjoin-specific logic in determining the
> number of qual evaluations? The only reason I can think of is to model
> the cost of comparing the hashes themselves.

I think the point is that there may *not* be few hash collisions ...

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-03-29 23:20:44
Message-ID:	1364599244.1187.186.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2013-03-29 at 16:37 -0400, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > Yes, I have run into this issue (or something very similar). I don't
> > understand why the bucketsize even matters much -- assuming few hash
> > collisions, we are not actually evaluating the quals any more times than
> > necessary. So why all of the hashjoin-specific logic in determining the
> > number of qual evaluations? The only reason I can think of is to model
> > the cost of comparing the hashes themselves.
>
> I think the point is that there may *not* be few hash collisions ...

In Stephen's case the table was only 41KB, so something still seems off.
Maybe we should model the likelihood of a collision based on the
cardinalities (assuming a reasonably good hash function)?

Also, I think I found an important assumption that seems dubious (in
comment for estimate_hash_bucketsize()):

"If the other relation in the join has a key distribution similar to
this one's, then the most-loaded buckets are exactly those that will be
probed most often. Therefore, the "average" bucket size for costing
purposes should really be taken as something close to the "worst case"
bucket size. We try to estimate this by adjusting the fraction if there
are too few distinct data values, and then scaling up by the ratio of
the most common value's frequency to the average frequency."

But the key distribution is not necessarily similar at all... the large
table might have many more distinct values.

Stephen, do you think this could explain your problem?

Regards,
Jeff Davis

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-03-30 20:15:25
Message-ID:	20130330201525.GY4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff,

* Jeff Davis (pgsql(at)j-davis(dot)com) wrote:
> On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote:
> > 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
> > 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66
>
> I think those are backwards -- typo?

Yes, sorry, those are backwards. The '4M hashed, seqscan 41K' entry
comes out with the lower cost and that's what we end up using.

Thanks,

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-03-30 20:52:42
Message-ID:	20130330205242.GZ4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> I think the point is that there may *not* be few hash collisions ...

Right, but that's actually all entirely based on concerns over there
being duplicates (hence why we consider the MCVs and ndistinct), which
makes *some* sense given that we currently have a single linked-list in
each bucket into which any dups are placed.

It occurs to me that we could get away from this by having a 2-level
system. We hash to a bucket which contains a linked list of linked
lists. The bucket list would be for actual collisions (which we don't
consider at all in the current bucket estimating) and each of those
entries would be a linked list of duplicates for that entry in the
bucket. This would add some small cost for building the hash, since we
would have to step through each entry in the bucket to determine if the
entry being added is a new entry or not, but not very much unless we're
worried about lots of collisions happening (which I certainly hope isn't
the case). Hash tables are generally expected to take more effort to
build initially anyway, so I don't see a problem putting more logic
there. Also, we could skip checking each entry in the bucket when the
input is known to be unique and instead just skip to the end of the
bucket since the new entry can't match any existing.

We could still work through the bucketing logic and add some costing to
that case for those situations where we are hashing on only one key of a
multi-key join and we expect a lot of duplicates to exist. I'm not sure
how much that happens though- I would hope that we would use a composite
hash key most of the time that we have multi-key joins that use hash
tables.

Thoughts?

Thanks,

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-03-30 21:07:02
Message-ID:	20130330210702.GA4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Jeff Davis (pgsql(at)j-davis(dot)com) wrote:
> In Stephen's case the table was only 41KB, so something still seems off.
> Maybe we should model the likelihood of a collision based on the
> cardinalities (assuming a reasonably good hash function)?

It's not really 'hash collisions' that we're trying to be wary of, per
se, it's the risk of duplicates. To push this very far in the other
direction- if you have 41k of the *same value* in the small table, then
it's currently faster to build the hash table on the large table and
then seq scan the small table (10s vs. 14s on my laptop running w/o
being plugged in, so it's a bit slower).

Now, that's a pretty ridiculous case, but it seems like we're being
pretty dumb here- for every input row from the outer table, we're
looking through all 41K *duplicate* keys in that one hash bucket. This
goes back to the suggestion I just made- if the hash bucket list
contained only unique values (which are a result of actual hash
collisions), then we'd only be doing *one* comparison for each input row
of the outer table that doesn't match- and when we found one which
*did*, we would only need to step through the dup list for that one
entry and blast back all of those rows, forgetting about the rest of the
bucket which we know can't match.

> Also, I think I found an important assumption that seems dubious (in
> comment for estimate_hash_bucketsize()):

I've wondered about that also. It certainly seems quite bogus that we
can end up with an 'estimated # of entries in a bucket' that's larger
than the # of entries we've found for the MCV in the table, especially
*double* that.

> Stephen, do you think this could explain your problem?

As I tried to articulate in my initial email- even if we had a *perfect*
answer to "how many comparisons will we need to do", the current costing
would cause us to pick the plan that, intuitively and empirically, takes
longer (hash the 41M row table) because that cost is multiplied times
the number of outer row tables and the cpu_tuple_cost (charged to build
the hash table) isn't high enough relative to the cpu_op_cost (charged
to do the comparisons in the buckets).

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-03-31 19:45:46
Message-ID:	15119.1364759146@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Jeff Davis (pgsql(at)j-davis(dot)com) wrote:
>> In Stephen's case the table was only 41KB, so something still seems off.
>> Maybe we should model the likelihood of a collision based on the
>> cardinalities (assuming a reasonably good hash function)?

> It's not really 'hash collisions' that we're trying to be wary of, per
> se, it's the risk of duplicates.

I spent some time looking at this. I think the real issue is that the
code is trying to charge something close to cpu_operator_cost for each
time an outer tuple is compared to a hashtable entry. That was actually
reasonable back when the costing code was last touched --- the current
scheme of storing and comparing the hash codes before testing the
equality operator proper only dates to commit
849074f9ae422c64501bb1d53ef840de870bf65c. I punted on changing the cost
estimates at that time, and I think what this example is showing is that
that was a bad decision. Really, when we're traipsing down a bucket
list, skipping over bucket entries with the wrong hash code is just
about free, or at least it's a whole lot cheaper than applying ExecQual.

Perhaps what we should do is charge the hash_qual_cost only for some
small multiple of the number of tuples that we expect will *pass* the
hash quals, which is a number we have to compute anyway. The multiple
would represent the rate of hash-code collisions we expect.

I'd still be inclined to charge something per bucket entry, but it
should be really small, perhaps on the order of 0.01 times
cpu_operator_cost.

Or we could just drop that term entirely. It strikes me that the reason
to be worried about skewed distribution in the inner relation is not
really that it changes the run time much, but rather that it risks
blowing out work_mem if specific virtual buckets have too many members
(or at best, we're forced to increase the number of batches more than we
thought to stay under work_mem; which carries runtime costs of its own).
Maybe what we should be doing with the bucketsize numbers is estimating
peak memory consumption to gate whether we'll accept the plan at all,
rather than adding terms to the cost estimate.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-01 06:35:59
Message-ID:	1364798159.25476.7.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote:
> Really, when we're traipsing down a bucket
> list, skipping over bucket entries with the wrong hash code is just
> about free, or at least it's a whole lot cheaper than applying ExecQual.
>
> Perhaps what we should do is charge the hash_qual_cost only for some
> small multiple of the number of tuples that we expect will *pass* the
> hash quals, which is a number we have to compute anyway. The multiple
> would represent the rate of hash-code collisions we expect.

+1.

> I'd still be inclined to charge something per bucket entry, but it
> should be really small, perhaps on the order of 0.01 times
> cpu_operator_cost.

> Or we could just drop that term entirely.

FWIW, either of those are fine with me based on my limited experience.

> Maybe what we should be doing with the bucketsize numbers is estimating
> peak memory consumption to gate whether we'll accept the plan at all,
> rather than adding terms to the cost estimate.

Sounds reasonable.

Ideally, we'd have a way to continue executing even in that case; but
that's a project by itself, and would make it even more difficult to
cost accurately.

Regards,
Jeff Davis

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-01 15:08:30
Message-ID:	CA+TgmoYC5yENSoLyVjzAF=nnLdWc6wYYVGYqTEvh2gJtWedptQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 1, 2013 at 2:35 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote:
>> Really, when we're traipsing down a bucket
>> list, skipping over bucket entries with the wrong hash code is just
>> about free, or at least it's a whole lot cheaper than applying ExecQual.
>>
>> Perhaps what we should do is charge the hash_qual_cost only for some
>> small multiple of the number of tuples that we expect will *pass* the
>> hash quals, which is a number we have to compute anyway. The multiple
>> would represent the rate of hash-code collisions we expect.
>
> +1.
>
>> I'd still be inclined to charge something per bucket entry, but it
>> should be really small, perhaps on the order of 0.01 times
>> cpu_operator_cost.
>
>> Or we could just drop that term entirely.
>
> FWIW, either of those are fine with me based on my limited experience.

FWIW, I have also seen this problem and the proposed fixes sound
reasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-01 16:46:26
Message-ID:	14141.1364834786@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Perhaps what we should do is charge the hash_qual_cost only for some
> small multiple of the number of tuples that we expect will *pass* the
> hash quals, which is a number we have to compute anyway. The multiple
> would represent the rate of hash-code collisions we expect.

I tried the attached quick-hack patch on Stephen's example. With
work_mem set to 16MB I get these results:

regression=# explain analyze select * from small_table join big_table using (id_short);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1229.46..74154.49 rows=41176 width=24) (actual time=47.723..1845.869 rows=13731 loops=1)
Hash Cond: (big_table.id_short = small_table.id_short)
-> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.045..506.212 rows=4272271 loops=1)
-> Hash (cost=714.76..714.76 rows=41176 width=24) (actual time=24.944..24.944 rows=41176 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 2574kB
-> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.007..11.608 rows=41176 loops=1)
Total runtime: 1847.697 ms
(7 rows)

Forcing the other plan to be chosen, I get

regression=# explain analyze select * from small_table join big_table using (id_short);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=131719.10..150327.44 rows=41176 width=24) (actual time=1922.942..2810.095 rows=13731 loops=1)
Hash Cond: (small_table.id_short = big_table.id_short)
-> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.012..10.058 rows=41176 loops=1)
-> Hash (cost=61626.71..61626.71 rows=4272271 width=4) (actual time=1921.962..1921.962 rows=4272271 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 9412kB
-> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.043..702.898 rows=4272271 loops=1)
Total runtime: 2820.633 ms
(7 rows)

So that's at least going in the right direction.

I have not thought about how the calculation should be adjusted in the
semi/anti join case, nor about how we ought to repurpose the
bucket-size-variance calculations for checking whether work_mem will be
exceeded. So this is a long way from being committable, but it seems
promising.

regards, tom lane

Attachment	Content-Type	Size
hash-cost-est-1.patch	text/x-patch	2.6 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 17:54:24
Message-ID:	20130404175424.GI4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom, all,

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> So that's at least going in the right direction.

I agree that this is going in the right direction; it certainly would
make the plan that I *expect* to be chosen more likely, however..

I've been fiddling with this on the very much larger overall database
where this test case came from and have found that hashing the large
table can actually be *faster* and appears to cause a more consistent
and constant amount of disk i/o (which is good).

The test case exhibits a bit of why this is the case- the per-tuple hash
lookup is way closer to the per-tuple cost of building the hash table
than I'd expect.

per-tuple cost to build the hash table (41M tuples): 0.33us
per-tuple cost to scan/do hash lookups (41M tuples): 0.29us
(with a small hash table of only 41K entries)

The total difference being: 1233.854 vs. 1428.424, or only 194.570ms in
favor of scanning the big table instead of hashing it.

These numbers are just from those posted with my original email:

http://explain.depesz.com/s/FEq
http://explain.depesz.com/s/FOU

I've seen much worse though- I have one case where hash-the-big-table
took 5s and hash-the-small-table took almost 10s (total times). I'm
trying to see if I can pull that out and isolate how it's different (and
see if it was just due to other load on the box).

What I'm trying to get at in this overall email is: why in the world is
it so expensive to do hash lookups? I would have expected the cost of
the hash table to be *much* more than the cost to do a hash lookup, and
that doing hash lookups against a small hash table would be fast enough
to put serious pressure on the i/o subsystem. Instead, the building of
the hash table actually puts more pressure and can end up being more
efficient overall. We have a process that basically does this a whole
bunch and the "hash-the-big-table" operation takes about 4.7 hrs, while
the "hash-the-small-table" approach went well past 5 hours and was only
about 70% complete.

Thoughts?

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 18:19:36
Message-ID:	5420.1365099576@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> I've been fiddling with this on the very much larger overall database
> where this test case came from and have found that hashing the large
> table can actually be *faster* and appears to cause a more consistent
> and constant amount of disk i/o (which is good).

Interesting.

> What I'm trying to get at in this overall email is: why in the world is
> it so expensive to do hash lookups?

perf or oprofile reveal anything?

Also, I assume that the cases you are looking at are large enough that
even the "small" table doesn't fit in a single hash batch? It could
well be that the answer has to do with some bogus or at least
unintuitive behavior of the batching process, and it isn't really at all
a matter of individual hash lookups being slow.

(You never did mention what work_mem setting you're testing, anyway.)

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 18:36:22
Message-ID:	20130404183622.GJ4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> > What I'm trying to get at in this overall email is: why in the world is
> > it so expensive to do hash lookups?
>
> perf or oprofile reveal anything?

Working on a test case actually- I've got one now:
http://snowman.net/~sfrost/test_case2.sql

In this example, hashing the large table is actually 2 seconds *faster*
than hashing the small table (again, all on my laptop).

> Also, I assume that the cases you are looking at are large enough that
> even the "small" table doesn't fit in a single hash batch?

No, quite the opposite, sorry for not mentioning that before. Either
side fits completely into memory w/ a single batch. The explain
analyze's that I posted before show that, either way, there's only one
batch involved.

> (You never did mention what work_mem setting you're testing, anyway.)

With the test case above (where I got a 2s faster run time by hashing
the big table) used a work_mem of 1GB.

Thanks!

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 19:25:47
Message-ID:	20130404192547.GK4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> perf or oprofile reveal anything?

Here's what we get from oprofile (perhaps not too surprising):

Hash the small table / scan the big table:
samples cum. samples % cum. % linenr info image name symbol name
167374 167374 47.9491 47.9491 nodeHash.c:915 postgres ExecScanHashBucket
85041 252415 24.3624 72.3115 mcount.c:60 libc-2.15.so __mcount_internal
28370 280785 8.1274 80.4389 _mcount.S:33 libc-2.15.so mcount
15856 296641 4.5424 84.9814 (no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff)
6291 302932 1.8022 86.7836 xact.c:682 postgres TransactionIdIsCurrentTransactionId
4555 307487 1.3049 88.0885 instrument.c:70 postgres InstrStopNode
3849 311336 1.1027 89.1912 heapam.c:711 postgres heapgettup_pagemode
3567 314903 1.0219 90.2130 nodeHashjoin.c:63 postgres ExecHashJoin

Hash the big table / scan the small table:
samples cum. samples % cum. % linenr info image name symbol name
112060 112060 39.2123 39.2123 mcount.c:60 libc-2.15.so __mcount_internal
36547 148607 12.7886 52.0009 nodeHash.c:709 postgres ExecHashTableInsert
33570 182177 11.7469 63.7477 _mcount.S:33 libc-2.15.so mcount
16383 198560 5.7328 69.4805 (no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ffffff)
13200 211760 4.6190 74.0995 (no location information) no-vmlinux /no-vmlinux
6345 218105 2.2203 76.3197 xact.c:682 postgres TransactionIdIsCurrentTransactionId
5250 223355 1.8371 78.1568 nodeHash.c:915 postgres ExecScanHashBucket
4797 228152 1.6786 79.8354 heapam.c:711 postgres heapgettup_pagemode
4661 232813 1.6310 81.4664 aset.c:563 postgres AllocSetAlloc
4588 237401 1.6054 83.0718 instrument.c:70 postgres InstrStopNode
3550 240951 1.2422 84.3140 memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back
3013 243964 1.0543 85.3684 aset.c:1109 postgres AllocSetCheck

Looking at the 'Hash the small table / scan the big table', opannotate
claims that this is bar far the worst offender:

147588 42.2808 : hashTuple = hashTuple->next;

While most of the time in the 'Hash the big table / scan the small
table' is in:

34572 12.0975 : hashTuple->next = hashtable->buckets[bucketno];

Neither of those strike me as terribly informative though. To be
honest, I've not really played w/ oprofile all that much. Now that I've
got things set up to support this, I'd be happy to provide more info if
anyone has suggestions on how to get something more useful.

It does look like reducing bucket depth, as I outlined before through
the use of a 2-level hashing system, might help speed up
ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
entries to consider instead of more. Along those same lines, I really
wonder if we're being too generous wrt the bucket-depth goal of '10'
instead of, say, '1', especially when we've got plenty of work_mem
available.

Thanks,

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 19:32:23
Message-ID:	20130404193223.GL4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> 85041 252415 24.3624 72.3115 mcount.c:60 libc-2.15.so __mcount_internal
> 28370 280785 8.1274 80.4389 _mcount.S:33 libc-2.15.so mcount
[...]

And as a side-note, I'm rebuilding w/o profiling, asserts, etc and will
run it again, though I don't really expect the top-hitters to change.

Thanks,

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 20:16:12
Message-ID:	20130404201612.GM4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> It does look like reducing bucket depth, as I outlined before through
> the use of a 2-level hashing system, might help speed up
> ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
> entries to consider instead of more. Along those same lines, I really
> wonder if we're being too generous wrt the bucket-depth goal of '10'
> instead of, say, '1', especially when we've got plenty of work_mem
> available.

Rerunning using a minimally configured build (only --enable-openssl
and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1'
results in a couple of interesting things-

First, the planner actually picks the plan to hash the small table and
seqscan the big one. That also, finally, turns out to be *faster* for
this test case.

explain analyze results here:
Hash small table / seqscan big table: http://explain.depesz.com/s/nP1
Hash big table / seqscan small table: http://explain.depesz.com/s/AUv

Here's the oprofile reports:

Hash small table / seqscan big table:
samples cum. samples % cum. % linenr info image name symbol name
39023 39023 52.8574 52.8574 nodeHash.c:915 postgres ExecScanHashBucket
3743 42766 5.0700 57.9273 xact.c:682 postgres TransactionIdIsCurrentTransactionId
3110 45876 4.2126 62.1399 nodeHashjoin.c:63 postgres ExecHashJoin
2561 48437 3.4689 65.6088 heapam.c:711 postgres heapgettup_pagemode
2427 50864 3.2874 68.8962 heapam.c:300 postgres heapgetpage
2395 53259 3.2441 72.1403 heaptuple.c:1028 postgres slot_deform_tuple
2395 55654 3.2441 75.3843 heaptuple.c:1135 postgres slot_getattr
2383 58037 3.2278 78.6122 nodeHash.c:786 postgres ExecHashGetHashValue
1811 59848 2.4530 81.0652 tqual.c:1044 postgres HeapTupleSatisfiesMVCC
1796 61644 2.4327 83.4979 execScan.c:111 postgres ExecScan
1298 62942 1.7582 85.2561 hashfunc.c:517 postgres hash_uint32
1274 64216 1.7257 86.9817 execProcnode.c:356 postgres ExecProcNode
1011 65227 1.3694 88.3511 heapam.c:1453 postgres heap_getnext
905 66132 1.2258 89.5770 execTuples.c:333 postgres ExecStoreTuple
858 66990 1.1622 90.7392 fmgr.c:1291 postgres FunctionCall1Coll
835 67825 1.1310 91.8702 execQual.c:668 postgres ExecEvalScalarVarFast
834 68659 1.1297 92.9999 mcxt.c:126 postgres MemoryContextReset
818 69477 1.1080 94.1078 nodeSeqscan.c:48 postgres SeqNext

Hash big table / seqscan small table:
samples cum. samples % cum. % linenr info image name symbol name
38612 38612 41.2901 41.2901 nodeHash.c:709 postgres ExecHashTableInsert
7435 46047 7.9507 49.2408 (no location information) no-vmlinux /no-vmlinux
4900 50947 5.2399 54.4806 aset.c:563 postgres AllocSetAlloc
3803 54750 4.0668 58.5474 xact.c:682 postgres TransactionIdIsCurrentTransactionId
3335 58085 3.5663 62.1137 heapam.c:711 postgres heapgettup_pagemode
2532 60617 2.7076 64.8213 nodeHash.c:786 postgres ExecHashGetHashValue
2523 63140 2.6980 67.5193 memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back
2518 65658 2.6926 70.2119 heaptuple.c:1028 postgres slot_deform_tuple
2378 68036 2.5429 72.7549 heapam.c:300 postgres heapgetpage
2374 70410 2.5387 75.2935 heaptuple.c:1135 postgres slot_getattr
1852 72262 1.9805 77.2740 nodeHash.c:915 postgres ExecScanHashBucket
1831 74093 1.9580 79.2320 tqual.c:1044 postgres HeapTupleSatisfiesMVCC
1732 75825 1.8521 81.0841 heapam.c:1453 postgres heap_getnext
1320 77145 1.4116 82.4957 nodeHash.c:76 postgres MultiExecHash
1219 78364 1.3035 83.7992 heaptuple.c:1529 postgres minimal_tuple_from_heap_tuple
1212 79576 1.2961 85.0953 execProcnode.c:356 postgres ExecProcNode
1209 80785 1.2929 86.3881 hashfunc.c:517 postgres hash_uint32
1197 81982 1.2800 87.6682 execScan.c:111 postgres ExecScan
1139 83121 1.2180 88.8862 execTuples.c:333 postgres ExecStoreTuple
1010 84131 1.0801 89.9662 execTuples.c:662 postgres ExecFetchSlotMinimalTuple
961 85092 1.0277 90.9939 aset.c:821 postgres AllocSetFree

Looking with opannotate, there's two main hotspots in
ExecScanHashBucket:

12846 17.4001 : hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];

and

22100 29.9348 : hashTuple = hashTuple->next;

I'm certainly curious about those, but I'm also very interested in the
possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
depending on the work_mem setting. It's only used in
ExecChooseHashTableSize, so while making it variable or depending on
work_mem could slow planning down a bit, it's not a per-tuple cost item.

Thoughts?

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 20:29:59
Message-ID:	10502.1365107399@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> Looking with opannotate, there's two main hotspots in
> ExecScanHashBucket:

> 12846 17.4001 : hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
> and
> 22100 29.9348 : hashTuple = hashTuple->next;

Those are, of course, pretty trivial statements; so the way I read this
is that the fundamental slowdown comes from the hash table being large
compared to the CPU's cache, so that you're incurring lots of cache
misses at these particular fetches. (You might be able to confirm that
if you can set oprofile to count cache misses rather than wall clock
time.)

> I'm certainly curious about those, but I'm also very interested in the
> possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> depending on the work_mem setting.

Not sure about that. That would make the hash-bucket-header array
larger without changing the size of the rest of the hash table, thus
probably making the CPU cache situation worse not better (which would
manifest as more time at the first of these two statements relative to
the second).

Can you add some instrumentation code to get us information about the
average/max length of the bucket chains? And maybe try to figure out
how many distinct hash values per bucket, which would give us a clue
whether your two-level-list idea is worth anything.

regards, tom lane

From:	"ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-04 21:11:13
Message-ID:	20130404211113.GN32580@aart.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Apr 04, 2013 at 04:16:12PM -0400, Stephen Frost wrote:
> * Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> > It does look like reducing bucket depth, as I outlined before through
> > the use of a 2-level hashing system, might help speed up
> > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
> > entries to consider instead of more. Along those same lines, I really
> > wonder if we're being too generous wrt the bucket-depth goal of '10'
> > instead of, say, '1', especially when we've got plenty of work_mem
> > available.
>
> Rerunning using a minimally configured build (only --enable-openssl
> and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1'
> results in a couple of interesting things-
>
> First, the planner actually picks the plan to hash the small table and
> seqscan the big one. That also, finally, turns out to be *faster* for
> this test case.
>
> ...
>
> I'm certainly curious about those, but I'm also very interested in the
> possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> depending on the work_mem setting. It's only used in
> ExecChooseHashTableSize, so while making it variable or depending on
> work_mem could slow planning down a bit, it's not a per-tuple cost item.
>
+1 for adjusting this based on work_mem value.

Ken

From:	Matthias <nitrogenycs(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stephen Frost" <sfrost(at)snowman(dot)net>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-05 11:54:29
Message-ID:	op.wu2go3b10uf2nk@nitrogenycs3
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> In this example, hashing the large table is actually 2 seconds *faster*
> than hashing the small table (again, all on my laptop).

Are you running the laptop on battery? When I've benchmarked pgsql last
time I used my laptop as well and it only occured to me after a lot of
trying that laptops (even with all energy saving disabled in my case)
don't always make for reliable benchmark machines. Things like your CPU
clockspeed being dynamically adjusted can produce really strange results.

Also when I was running on battery the performance numbers could not be
compared in any way to when I was running with the laptop connected
straight to a socket. Things like IO/CPU ratio were completely different.
And numbers on the final testing servers were even different.

Of course your test case might not be affected by this at all, but it's
something to watch out for.

-Matthias

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Matthias <nitrogenycs(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-05 12:08:24
Message-ID:	20130405120824.GN4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Matthias (nitrogenycs(at)gmail(dot)com) wrote:
> >In this example, hashing the large table is actually 2 seconds *faster*
> >than hashing the small table (again, all on my laptop).
>
> Are you running the laptop on battery? When I've benchmarked pgsql
> last time I used my laptop as well and it only occured to me after a
> lot of trying that laptops (even with all energy saving disabled in
> my case) don't always make for reliable benchmark machines. Things
> like your CPU clockspeed being dynamically adjusted can produce
> really strange results.

Those runs were with the laptop plugged in, but I've also run it w/o the
battery and while the performance is certainly different between those
two cases, the relative speed of hashing vs. hash-lookup has been
consistent. Also, that's why I provided the test case- feel free (and
please do!) test it on any/all hardware you can find. I'd love to hear
reports from others on their experiences. Also, the relative speeds on
my laptop runs matched the performance (the laptop was slower, but
slower in both paths in a comparable way) on the big server where this
is all originating.

> Of course your test case might not be affected by this at all, but
> it's something to watch out for.

Certainly.

Thanks,

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hash Join cost estimates
Date:	2013-04-13 03:06:19
Message-ID:	20130413030619.GN4361@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > I'm certainly curious about those, but I'm also very interested in the
> > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
> > depending on the work_mem setting.
>
> Not sure about that. That would make the hash-bucket-header array
> larger without changing the size of the rest of the hash table, thus
> probably making the CPU cache situation worse not better (which would
> manifest as more time at the first of these two statements relative to
> the second).

In the testing that I've done, I've yet to find a case where having a
smaller NTUP_PER_BUCKET makes things worse and it can have a dramatic
improvement. Regarding the memory situation, I'm not sure that using
buckets really helps, though I'm no CPU architect. However, assuming
that the CPU is smart enough to only pull in bits of memory that it
needs, I'm not sure how having to pull in parts of a large array of
pointers is worse than having to go fetch randomly placed entries in a
bucket, especially since we have to step through an entire bucket each
time and the bucket tuples are allocated independently, while the hash
array is allocated once. Indeed, it seems you'd be more likely to get
other things you need in a given pull into cache for the hash table
array than with the bucket tuple entries which might well include
unrelated garbage.

> Can you add some instrumentation code to get us information about the
> average/max length of the bucket chains? And maybe try to figure out
> how many distinct hash values per bucket, which would give us a clue
> whether your two-level-list idea is worth anything.

I ended up implementing the two-level system and doing some testing with
it. It ends up making hash table building take quite a bit longer and
only improves the scan performance in very select cases. The
improvement requires an individual bucket to have both lots of dups and
a lot of distinct values, because it only helps when it can skip a
significant number of tuples. If we move to a system where there's
rarely more than one distinct value in a bucket then the chances of a
serious improvment from the two-level system goes down that much more.

As such, I've come up with this (trival) patch which simply modifies
ExecChooseHashTableSize() to ignore NTUP_PER_BUCKET (essentially
treating it as 1) when work_mem is large enough to fit the entire hash
table (which also implies that there is only one batch). I'd love to
hear feedback from others on what this does under different conditions.
This also makes the "hash-the-small-table" case faster for the test case
which I provided earlier (http://snowman.net/~sfrost/test_case2.sql),
and it uses quite a bit less memory too. Now that I've convinced myself
that the two-level system isn't practical and the "hash-the-small-table"
case is actually faster than "hash-the-big-table", I'll start looking
at improving on your proposal to change the costing to favor the smaller
table being hashed more often.

Thanks,

Stephen

Attachment	Content-Type	Size
ignore-ntup-per-bucket.patch	text/x-diff	681 bytes