Re: bad estimation together with large work_mem generates terrible slow hash joins

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Date: 2014-06-29 23:24:44
Message-ID: 53B0A03C.3080805@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26.6.2014 23:48, Tomas Vondra wrote:
> On 26.6.2014 20:43, Tomas Vondra wrote:
>> Attached is v2 of the patch, with some cleanups / minor improvements:
>>
>> * there's a single FIXME, related to counting tuples in the
>
> Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
> patch. This just adds a proper 'batch tuples' counter to the hash table.
>
> All comments, measurements on different queries etc. welcome. We'll
> certainly do a lot of testing, because this was a big issue for us.

Attached is v4 of the patch, with a few minor improvements. The only
thing worth mentioning is overflow protection, similar to what's done in
the ExecChooseHashTableSize() function. Otherwise it's mostly about
improving comments.

Also attached is a v4 with GUC, making it easier to compare effect of
the patch, by simply setting "enable_hashjoin_bucket" to "off" (original
behaviour) or "on" (new behaviour).

And finally there's an SQL script demonstrating the effect of the patch
with various work_mem settings. For example what I see on my desktop is
this (averages from 3 runs):

===== SMALL WORK MEM (2MB) =====
no dynamic buckets dynamic buckets
query A 5945 ms 5969 ms
query B 6080 ms 5943 ms
query C 6531 ms 6822 ms
query D 6962 ms 6618 ms

===== MEDIUM WORK MEM (16MB) =====
no dynamic buckets dynamic buckets
query A 7955 ms 7944 ms
query B 9970 ms 7569 ms
query C 8643 ms 8560 ms
query D 33908 ms 7700 ms

===== LARGE WORK MEM (64MB) =====
no dynamic buckets dynamic buckets
query A 10235 ms 10233 ms
query B 32229 ms 9318 ms
query C 14500 ms 10554 ms
query D 213344 ms 9145 ms

Where "A" is "exactly estimated" and the other queries suffer by various
underestimates. My observations from this are:

(1) For small work_mem values it does not really matter, thanks to the
caching effects (the whole hash table fits into L2 CPU cache).

(2) For medium work_mem values (not really huge, but exceeding CPU
caches), the differences are negligible, except for the last query
with most severe underestimate. In that case the new behaviour is
much faster.

(3) For large work_mem values, the speedup is pretty obvious and
dependent on the underestimate.

The question is why to choose large work_mem values when the smaller
values actually perform better. Well, the example tables are not
perfectly representative. In case the outer table is much larger and
does not fit into RAM that easily (which is the case of large fact
tables or joins), the rescans (because of more batches) are more
expensive and outweight the caching benefits.

Also, the work_mem is shared with other nodes, e.g. aggregates, and
decreasing it because of hash joins would hurt them.

regards
Tomas

Attachment Content-Type Size
hashjoin-nbuckets-growth-v4.patch text/x-diff 9.4 KB
hashjoin-nbuckets-growth-v4-with-guc.patch text/x-diff 11.0 KB
hashjoin.sql application/sql 10.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-06-29 23:31:53 Re: idle_in_transaction_timeout
Previous Message Andres Freund 2014-06-29 23:15:44 Deferring some AtStart* allocations?