Re: bad estimation together with large work_mem generates terrible slow hash joins

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Date: 2014-08-11 22:30:31
Message-ID: 53E94407.9060705@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11.8.2014 20:25, Robert Haas wrote:
> On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> Adding least-significant bit does not work, we need get back to adding
>> the most-significant one. Not sure what's the least complex way to do
>> that, though.
>>
>> I'm thinking about computing the nbuckets limit (how many buckets may
>> fit into work_mem):
>>
>> tupsize = HJTUPLE_OVERHEAD +
>> MAXALIGN(sizeof(MinimalTupleData)) +
>> MAXALIGN(tupwidth);
>>
>> nbuckets_bits_max = my_log2(work_mem / tupsize)
>>
>> and then start with nbatches from this bit, like this:
>>
>> 31 23 max 15 7 0
>> |----------------|----------------|----------------|----------------|
>> | | <- batches | free | <- buckets |
>
> That doesn't seem unreasonable provide there's enough bit space, but,
> as you point out, the day might not be far off when there isn't.

Thinking about this a bit more, I believe the danger is not that
imminent. 32 bits mean the Hash node should handle 2^32 tuples in total
(possibly split into multiple batches).

It used to be "2^32 buckets" thanks to NTUP_PER_BUCKET=10, but the patch
lowers this to 1 so it's "tuples" now.

But while we're we're a bit closer to exhausting the 32 bits, I think
it's not really that big issue. Not only hashing a table with ~4e9 rows
is going to be a hellish experience, but if we really want to support
it, we should probably switch to 64-bit hashes.

Because adding some checks is not going to help - it may detect an
issue, but it won't fix it. And actually, even if the two values get
overlap, it does not mean the hashjoin will stop working. There'll be
dependency between batches and buckets, causing non-uniform usage of the
buckets, but it won't blow up.

So I don't think we need to worry about exhausting the 32 bits for now.

> It also strikes me that when there's only 1 batch, the set of bits
> that map onto the batch number is zero-width, and one zero-width bit
> range is as good as another. In other words, if you're only planning
> to do one batch, you can easily grow the number of buckets on the fly.
> Growing the number of buckets only becomes difficult once you have
> more than one batch.

Yes, that's true. It's also roughly what the patch does IMHO.

If you know you'll need more batches, you can start with the maximum
number of buckets right away (because you expect there's more data than
work_mem, so shoot for

nbuckets = mylog2(work_mem/tuple_size)

or something like that. It's also true that if you're starting with a
single batch, you can do whatever you want with nbuckets until you need
to do (nbatches *= 2).

It's also true that once you're done with the first batch, all the other
batches will be just fine with the number of buckets, because the
batches be about equally sized.

But I think this is pretty much what the patch does, in the end.

> And I mention that because, in the scenario mentioned in your original
> post ("a hash table with small number of buckets (8192) containing
> large number of tuples [~3.3M]"), presumably what happens is you start
> out thinking you are going to have one batch with 8K buckets. Then,
> as more tuples continue to pour in, you see the load factor rising and
> responding by bumping up the size of the hash table. Now eventually
> you realize, gosh, this isn't even going to fit into work_mem, so you
> need to start batching it. But by that point you've presumably done
> as much as you're going to do in terms of increasing the number of
> buckets; after that, you're just going to add more batches as
> necessary. So not being able to further increase the number of
> buckets once the number of batches is no longer 1 wouldn't be a
> problem in that case.
>
> Now if you start out planning for multiple batches, and then you
> discover you'd like to keep the same number of batches but have more
> buckets per batch, that's gonna be hard. It's not impossible, because
> you could steal some of the unused high-order bits above the number of
> batches, and then concatenate them with the low-order bits that you
> already had, but that seems like kind of an ugly kludge, and AFAICS it
> only helps if the new tuples are narrower by enough to justify the
> cost of splitting all the buckets. I won't say that couldn't happen,
> but I think it would be better to keep that complexity out of the
> patch unless we're absolutely certain it's justified.

I'm definitely in favor of keeping the patch as simple as possible.

I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:

* Are more batches needed?

(yes) => just use nbuckets = my_log2(work_mem / tuple_size)

(no) => go ahead, until processing all tuples or hitting work_mem

(work_mem) => meh, use the same nbuckets above

(all tuples) => compute optimal nbuckets / resize

But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.

Each batch should contain about the same number of tuples, or more
correctly, the same number of distinct hash values. But if there are
multiple tuples with the same hash value, increasing the number of
buckets is not going to help (it's still going to end up in the same
bucket).

And the number of tuples in a batch may only get lower, while adding
batches (e.g. if a batch somewhere in the middle gets a lot of tuples
with the same hash value, exceeding work_mem).

I need to think about this a bit more, but so far it seems like a good
clean solution to me.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-08-11 23:40:30 Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations
Previous Message Andres Freund 2014-08-11 21:52:32 Re: 9.4 logical replication - walsender keepalive replies