Re: 9.5: Memory-bounded HashAgg

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-13 05:02:22
Message-ID: 1407906142.2335.47.camel@jeff-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote:
> CREATE AGGREGATE myaggregate (
> ...
> SERIALIZE_FUNC = 'dump_data',
> DESERIALIZE_FUNC = 'read_data',
> ...
> );

Seems reasonable.

> I don't see why it should get messy? In the end, you have a chunk of
> data and a hash for it.

Perhaps it's fine; I'd have to see the approach.

> It just means you need to walk through the hash table, look at the
> hashes and dump ~50% of the groups to a file.

If you have fixed-size states, why would you *want* to remove the group?
What is gained?

One thing I like about my simple approach is that it returns a good
number of groups after each pass, and then those are completely finished
(returned to the operator above, even). That's impossible with HashJoin
because the hashing all needs to be done before the probe phase begins.

The weakness of my approach is the array_agg case that you mention,
because this approach doesn't offer a way to dump out transition states.
It seems like that could be added later, but let me know if you see a
problem there.

> I think you're missing the point, here. You need to compute the hash in
> both cases. And then you either can do a lookup or just peek at the first
> few bits of the hash to see whether it's in the current batch or not.

I understood that. The point I was trying to make (which might or might
not be true) was that: (a) this only matters for a failed lookup,
because a successful lookup would just go in the hash table anyway; and
(b) a failed lookup probably doesn't cost much compared to all of the
other things that need to happen along that path.

I should have chosen a better example though. For instance: if the
lookup fails, we need to write the tuple, and writing the tuple is sure
to swamp the cost of a failed hash lookup.

> is much faster than a lookup. Also, as the hash table grows (beyond L3
> cache size, which is a few MBs today), it becomes much slower in my
> experience - that's one of the lessons I learnt while hacking on the
> hashjoin. And we're dealing with hashagg not fitting into work_mem, so
> this seems to be relevant.

Could be, but this is also the path that goes to disk, so I'm not sure
how significant it is.

> > Because I suspect there are costs in having an extra file around that
> > I'm not accounting for directly. We are implicitly assuming that the OS
> > will keep around enough buffers for each BufFile to do sequential writes
> > when needed. If we create a zillion partitions, then either we end up
> > with random I/O or we push the memory burden into the OS buffer cache.
>
> Assuming I understand it correctly, I think this logic is broken. Are you
> saying "We'll try to do memory-bounded hashagg, but not for the really
> large datasets because of fear we might cause random I/O"?

No, the memory is still bounded even for very high cardinality inputs
(ignoring array_agg case for now). When a partition is processed later,
it also might exhaust work_mem, and need to write out tuples to its own
set of partitions. This allows memory-bounded execution to succeed even
if the number of partitions each iteration is one, though it will result
in repeated I/O for the same tuple.

> While I certainly understand your concerns about generating excessive
> amount of random I/O, I think the modern filesystem are handling that just
> fine (coalescing the writes into mostly sequential writes, etc.). Also,
> current hardware is really good at handling this (controllers with write
> cache, SSDs etc.).

All of that requires memory. We shouldn't dodge a work_mem limit by
using the kernel's memory, instead.

> Also, if hash-join does not worry about number of batches, why should
> hashagg worry about that? I expect the I/O patterns to be very similar.

One difference with HashJoin is that, to create a large number of
batches, the inner side must be huge, which is not the expected
operating mode for HashJoin[1]. Regardless, every partition that is
active *does* have a memory cost. HashJoin might ignore that cost, but
that doesn't make it right.

I think the right analogy here is to Sort's poly-phase merge -- it
doesn't merge all of the runs at once; see the comments at the top of
tuplesort.c.

In other words, sometimes it's better to have fewer partitions (for
hashing) or merge fewer runs at once (for sorting). It does more
repeated I/O, but the I/O is more sequential.

> In any case, trying to fix this by limiting number of partitions seems
> like a bad approach. I think factoring those concerns into a costing
> model is more appropriate.

Fair enough. I haven't modeled the cost yet; and I agree that an upper
limit is quite crude.

> (a) COUNT(DISTINCT) -> this is solved by a custom aggregate

Is there a reason we can't offer a hash-based strategy for this one? It
would have to be separate hash tables for different aggregates, but it
seems like it could work.

> (b) bad estimate of required memory -> this is common for aggregates
> passing 'internal' state (planner uses some quite high defaults)

Maybe some planner hooks? Ideas?

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2014-08-13 05:10:21 Re: Support for N synchronous standby servers
Previous Message Amit Kapila 2014-08-13 04:21:58 Re: Scaling shared buffer eviction