Quick Links

Re: bad estimation together with large work_mem generates terrible slow hash joins

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: bad estimation together with large work_mem generates terrible slow hash joins
Date:	2014-08-19 18:37:55
Message-ID:	53F39983.2020303@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 19.8.2014 19:05, Robert Haas wrote:
> On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> On 12.8.2014 00:30, Tomas Vondra wrote:
>>> On 11.8.2014 20:25, Robert Haas wrote:
>>>> It also strikes me that when there's only 1 batch, the set of bits
>>>> that map onto the batch number is zero-width, and one zero-width bit
>>>> range is as good as another. In other words, if you're only planning
>>>> to do one batch, you can easily grow the number of buckets on the fly.
>>>> Growing the number of buckets only becomes difficult once you have
>>>> more than one batch.
>>
>> ...
>>
>>> I was considering using reversing the bits of the hash, because that's
>>> pretty much the simplest solution. But I think you're right it might
>>> actually work like this:
>>>
>>> * Are more batches needed?
>>>
>>> (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>>>
>>> (no) => go ahead, until processing all tuples or hitting work_mem
>>>
>>> (work_mem) => meh, use the same nbuckets above
>>>
>>> (all tuples) => compute optimal nbuckets / resize
>>>
>>>
>>> But I need to think about this a bit. So far it seems to me there's no
>>> way additional batches might benefit from increasing nbuckets further.
>>
>> I think this is a simple and solid solution, solving the batchno
>> computation issues quite nicely. Attached is v10 patch (bare and
>> combined with the dense allocation), that does this:
>>
>> 1) when we know we'll need batching, buckets are sized for full work_mem
>> (using the estimated tuple width, etc.)
>>
>> 2) without the batching, we estimate the 'right number of buckets' for
>> the estimated number of tuples, and keep track of the optimal number
>> as tuples are added to the hash table
>>
>> - if we discover we need to start batching, we keep the current
>> optimal value (which should be the same as the max number of
>> buckets) and don't mess with it anymore (making it possible to
>> compute batch IDs just like before)
>>
>> - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
>> is resized as part of the rebatch
>>
>> - if the hash build completes without batching, we do the resize
>>
>> I believe the patch is pretty much perfect. I plan to do more thorough
>> testing on a wide range of queries in the next few days.
>>
>> I also removed the 'enable_hash_resize' GUC, because it would be more
>> complex to implement this properly after doing the resize as part of
>> rebatch etc.. So either it would make the patch more complex, or it
>> wouldn't do what the name promises.
>
> A variety of trivial comments on this:
>
> PostgreSQL style is un-cuddled curly braces. Also, multi-line
> comments need to start with a line containing only /* and end with a
> line containing only */. In one place you've added curly braces
> around a single-line block that is otherwise unmodified; please don't
> do that. In one place, you have "becase" instead of "because". In
> another place, you write "add if after it" but it should say "add it
> after it" or maybe better "add the new one after it". Avoid using
> punctuation like "=>" in comments to illustrate the connection between
> sentences; instead, use a connecting word like "then" or "therefore"
> or whatever is appropriate; in this instance, a period followed by the
> start of a new sentence seems sufficient. Revert the removal of a
> single line of whitespace near the top of nodeHash.c.
>
> There are too many things marked XXX in this patch. They should
> either be fixed, if they are real problems, or they should be
> commented in a way that doesn't give rise to the idea that they're
> problems if they aren't.

OK, thanks for pointing this out. Attached is v11 of the patch (both
separate and combined with the dense allocation, as before).

I fixed as many of those issues as possible. All the XXX items were
obsolete, except for one in the chunk_alloc function.

I have also removed one constant

>
> OK, now on to some more substantive stuff:
>
> 1. It's not clear to me what the overall effect of this patch on
> memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going
> to use, on the average, 10x as much bucket-header memory per tuple.
> Specifically, I think it means we'll use about 8 bytes of
> bucket-header memory per tuple instead of 0.8 bytes per tuple. If the
> tuples are narrow, that could be significant; concerns have been
> expressed about that here in the past. Increasing the number of
> buckets could also increase memory usage. On the other hand, the
> dense allocation stuff probably saves a ton of memory, so maybe we end
> up overall, but I'm not sure. Your thoughts, and maybe some test
> results with narrow and wide tuples, would be appreciated.

The effect of the dense allocation was briefly discussed in this thread,
along with some quick measurements:

http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz

The dense allocation removes pretty much all the palloc overhead. For a
40B tuple, I did get this before the dense allocation

HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used

and this with the dense allocation patch applied

HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used

So it's pretty much 50% reduction in memory consumption.

Now, the palloc header is >=16B, and removing this more than compensates
the increased number of buckets (in the worst case, there may be ~2x
buckets per tuple, which is 2x8B pointers).

I did a number of tests, and the results are quite consistent with this.

> 2. But, on the positive side, modulo the memory utilization questions
> mentioned above, I would expect the impact on hash join performance to
> be positive. Going from 10 tuples per bucket to just 1 should help,
> and on cases where the actual load factor would have ended up much
> higher because of poor estimation, increasing the number of buckets on
> the fly should help even more. I haven't tested this, though.

Yes, that's the results I was getting.

I've done a number of tests, and not a single one showed a slowdown so
far. Most well-estimated queries were 2-3x faster, and the poorly
estimated ones were orders of magnitude faster.

We're working on getting this fixed on a range of workloads, I'll post
some results once I have that.

> I haven't had a chance to completely go through this yet, so these are
> just some initial thoughts.

Thanks!

Tomas

Attachment	Content-Type	Size
hashjoin-nbuckets-growth-v11.patch	text/x-diff	18.7 KB
hashjoin-nbuckets-growth-v11-combined.patch	text/x-diff	20.8 KB

In response to

Re: bad estimation together with large work_mem generates terrible slow hash joins at 2014-08-19 17:05:07 from Robert Haas

Responses

Re: bad estimation together with large work_mem generates terrible slow hash joins at 2014-09-05 19:23:58 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen Frost	2014-08-19 18:40:15	Re: PQgetssl() and alternative SSL implementations
Previous Message	Rahila Syed	2014-08-19 18:36:45	Re: [REVIEW] Re: Compression of full-page-writes