Re: 9.5: Memory-bounded HashAgg

From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-13 10:31:37
Message-ID: c12f491d2768d8753270a94f9464a2e7.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13 Srpen 2014, 7:02, Jeff Davis wrote:
> On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote:
>> CREATE AGGREGATE myaggregate (
>> ...
>> SERIALIZE_FUNC = 'dump_data',
>> DESERIALIZE_FUNC = 'read_data',
>> ...
>> );
>
> Seems reasonable.
>
>> I don't see why it should get messy? In the end, you have a chunk of
>> data and a hash for it.
>
> Perhaps it's fine; I'd have to see the approach.
>
>> It just means you need to walk through the hash table, look at the
>> hashes and dump ~50% of the groups to a file.
>
> If you have fixed-size states, why would you *want* to remove the group?
> What is gained?

You're right that for your batching algorithm (based on lookups), that's
not really needed, and keeping everything in memory is a good initial
approach.

My understanding of the batching algorithm (and I may be wrong on this
one) is that once you choose the number of batches, it's pretty much
fixed. Is that the case?

But what will happen in case of significant cardinality underestimate?
I.e. what will happen if you decide to use 16 batches, and then find
out 256 would be more appropriate? I believe you'll end up with batches
16x the size you'd want, most likely exceeding work_mem.

Do I understand that correctly?

But back to the removal of aggregate states from memory (irrespectedly
of the size) - this is what makes the hashjoin-style batching possible,
because it:

(a) makes the batching decision simple (peeking at hash)
(b) makes it possible to repeatedly increase the number of batches
(c) provides a simple trigger for the increase of batch count

Some of this might be achievable even with keeping the states in memory.
I mean, you can add more batches on the fly, and handle this similarly
to hash join, while reading tuples from the batch (moving the tuples to
the proper batch, if needed).

The problem is that once you have the memory full, there's no trigger
to alert you that you should increase the number of batches again.

> One thing I like about my simple approach is that it returns a good
> number of groups after each pass, and then those are completely finished
> (returned to the operator above, even). That's impossible with HashJoin
> because the hashing all needs to be done before the probe phase begins.

The hash-join approach returns ~1/N groups after each pass, so I fail to
see how this is better?

> The weakness of my approach is the array_agg case that you mention,
> because this approach doesn't offer a way to dump out transition states.
> It seems like that could be added later, but let me know if you see a
> problem there.

Right. Let's not solve this in the first version of the patch.

>> I think you're missing the point, here. You need to compute the hash in
>> both cases. And then you either can do a lookup or just peek at the
>> first
>> few bits of the hash to see whether it's in the current batch or not.
>
> I understood that. The point I was trying to make (which might or might
> not be true) was that: (a) this only matters for a failed lookup,
> because a successful lookup would just go in the hash table anyway; and
> (b) a failed lookup probably doesn't cost much compared to all of the
> other things that need to happen along that path.

OK. I don't have numbers proving otherwise at hand, and you're probably
right that once the batching kicks in, the other parts are likely more
expensive than this.

> I should have chosen a better example though. For instance: if the
> lookup fails, we need to write the tuple, and writing the tuple is sure
> to swamp the cost of a failed hash lookup.
>
>> is much faster than a lookup. Also, as the hash table grows (beyond L3
>> cache size, which is a few MBs today), it becomes much slower in my
>> experience - that's one of the lessons I learnt while hacking on the
>> hashjoin. And we're dealing with hashagg not fitting into work_mem, so
>> this seems to be relevant.
>
> Could be, but this is also the path that goes to disk, so I'm not sure
> how significant it is.

It may or may not go to the disk, actually. The fact that you're doing
batching means it's written to a temporary file, but with large amounts
of RAM it may not get written to disk.

That's because the work_mem is only a very soft guarantee - a query may
use multiple work_mem buffers in a perfectly legal way. So the users ten
to set this rather conservatively. For example we have >256GB of RAM in
each machine, usually <24 queries running at the same time and yet we
have only work_mem=800MB. On the few occasions when a hash join is
batched, it usually remains in page cache and never actually gets writte
to disk. Or maybe it gets written, but it's still in the page cache so
the backend never notices that.

It's true there are other costs though - I/O calls, etc. So it's not free.

>
>> > Because I suspect there are costs in having an extra file around that
>> > I'm not accounting for directly. We are implicitly assuming that the
>> OS
>> > will keep around enough buffers for each BufFile to do sequential
>> writes
>> > when needed. If we create a zillion partitions, then either we end up
>> > with random I/O or we push the memory burden into the OS buffer cache.
>>
>> Assuming I understand it correctly, I think this logic is broken. Are
>> you
>> saying "We'll try to do memory-bounded hashagg, but not for the really
>> large datasets because of fear we might cause random I/O"?
>
> No, the memory is still bounded even for very high cardinality inputs
> (ignoring array_agg case for now). When a partition is processed later,
> it also might exhaust work_mem, and need to write out tuples to its own
> set of partitions. This allows memory-bounded execution to succeed even
> if the number of partitions each iteration is one, though it will result
> in repeated I/O for the same tuple.

Aha! And the new batches are 'private' to the work item, making it a bit
recursive, right? Is there any reason not to just double the number of
batches globally? I mean, why not to just say

nbatches *= 2

which effectively splits each batch into two? Half the groups stays
in the current one, half is moved to a new one.

It makes it almost perfectly sequential, because you're reading
a single batch, keeping half the tuples and writing the other half to
a new batch. If you increase the number of batches a bit more, e.g.

nbatches *= 4

then you're keeping 1/4 and writing into 3 new batches.

That seems like a better solution to me.

>
>> While I certainly understand your concerns about generating excessive
>> amount of random I/O, I think the modern filesystem are handling that
>> just
>> fine (coalescing the writes into mostly sequential writes, etc.). Also,
>> current hardware is really good at handling this (controllers with write
>> cache, SSDs etc.).
>
> All of that requires memory. We shouldn't dodge a work_mem limit by
> using the kernel's memory, instead.

Sure, saving memory at one place just to waste it somewhere else is
a poor solution. But I don't think work_mem is a memory-saving tool.
I see it as a memory-limiting protection.

>> Also, if hash-join does not worry about number of batches, why should
>> hashagg worry about that? I expect the I/O patterns to be very similar.
>
> One difference with HashJoin is that, to create a large number of
> batches, the inner side must be huge, which is not the expected
> operating mode for HashJoin[1]. Regardless, every partition that is
> active *does* have a memory cost. HashJoin might ignore that cost, but
> that doesn't make it right.
>
> I think the right analogy here is to Sort's poly-phase merge -- it
> doesn't merge all of the runs at once; see the comments at the top of
> tuplesort.c.
>
> In other words, sometimes it's better to have fewer partitions (for
> hashing) or merge fewer runs at once (for sorting). It does more
> repeated I/O, but the I/O is more sequential.

OK. I don't have a clear opinion on this yet. I don't think the costs
are that high, but maybe I'm wrong in this.

It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then allowing
each work item to create it's own set of additional partitions effectively
renders the HASH_DISK_MAX_PARTITIONS futile.

>
>> In any case, trying to fix this by limiting number of partitions seems
>> like a bad approach. I think factoring those concerns into a costing
>> model is more appropriate.
>
> Fair enough. I haven't modeled the cost yet; and I agree that an upper
> limit is quite crude.

OK, let's keep the HASH_DISK_MAX_PARTITIONS for now and improve this later.

>
>> (a) COUNT(DISTINCT) -> this is solved by a custom aggregate
>
> Is there a reason we can't offer a hash-based strategy for this one? It
> would have to be separate hash tables for different aggregates, but it
> seems like it could work.

I don't know what is the exact reasoning, but apparently it's how the
current planner works. Whenever it sees COUNT(DISTINCT) it enforces a
sort. I suspect this is because of fear of memory requirements (because
a distinct requires keeping all the items), which might have been
perfectly valid when this was designed.

>
>> (b) bad estimate of required memory -> this is common for aggregates
>> passing 'internal' state (planner uses some quite high defaults)
>
> Maybe some planner hooks? Ideas?

My plan is to add this to the CREATE AGGREGATE somehow - either as a
constant parameter (allowing to set a custom constant size) or a callback
to a 'sizing' function (estimating the size based on number of items,
average width and ndistinct in the group). In any case, this is
independent of this patch.

I think that for this patch we may either keep the current batching
strategy (and proceed with the TODO items you listed in your first patch).

Or we may investigate the alternative (hash-join-like) batching strategy.
I suppose this may be done after the TODO items, but I'm afrait it may
impact some of them (e.g. the costing). This can be done with the
simple aggregates (using fixed-size types for states), but eventually
it will require adding the serialize/deserialize to CREATE AGGREGATE.

Now, I'm very in favor of the #2 choice (because that's what works best
with the aggregates I need to use), but I'm also a big fan of the
'availability beats unavailable features 100% of the time' principle.

So if you decide to go for #1 now, I'm fine with that. I'm open to do
the next step - either as a follow-up patch, or maybe as an alternative
spin-off of your patch.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip kumar 2014-08-13 10:31:58 Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
Previous Message David Rowley 2014-08-13 10:26:01 Re: strncpy is not a safe version of strcpy