Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Better memory accounting, towards memory-bounded HashAgg
Date: 2014-10-19 00:46:07
Message-ID: 544309CF.80103@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 16.10.2014 03:26, Jeff Davis wrote:
> On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote:
>> What about removing the callback per se and just keeping the
>> argument, as it were. That is, a MemoryContext has a field of type
>> "size_t *" that is normally NULL, but if set to non-NULL, then we
>> increment the pointed-to value for pallocs and decrement for
>> pfrees.
>
> I like that idea; patch attached. Two differences:
> * At block level, not chunk level.
> * I use a uint64, because a size_t is only guaranteed to hold one
> allocation, and we need to sum up many allocations.
>
> When unused, it does still appear to have a little overhead in
> Robert's test on his power machine. It seems to be between a 0.5-1.0%
> regression. There's not much extra code on that path, just a branch,
> pointer dereference, and an addition; so I don't think it will get
> much cheaper than it is.
>
> This regression seems harder to reproduce on my workstation, or
> perhaps it's just noisier.

I assume it's the "reindex pgbench_accounts_pkey" test, correct?

I did the same test on my workstation (with i7-4770R CPU), and ISTM
there's a measurable difference, although in surprising directions. For
binaries compiled with gcc 4.7.3 and clang 3.3 and 3.5, the results look
like this (5 runs for each combination):

gcc 4.7.3 / master
------------------
LOG: internal sort ended, ... CPU 1.24s/4.32u sec elapsed 9.78 sec
LOG: internal sort ended, ... CPU 1.29s/4.31u sec elapsed 9.80 sec
LOG: internal sort ended, ... CPU 1.26s/4.31u sec elapsed 9.80 sec
LOG: internal sort ended, ... CPU 1.26s/4.37u sec elapsed 9.78 sec
LOG: internal sort ended, ... CPU 1.33s/4.29u sec elapsed 9.78 sec

gcc 4.7.3 / patched
-------------------
LOG: internal sort ended, ... CPU 1.35s/4.27u sec elapsed 9.78 sec
LOG: internal sort ended, ... CPU 1.33s/4.30u sec elapsed 9.74 sec
LOG: internal sort ended, ... CPU 1.34s/4.27u sec elapsed 9.75 sec
LOG: internal sort ended, ... CPU 1.27s/4.26u sec elapsed 9.78 sec
LOG: internal sort ended, ... CPU 1.35s/4.26u sec elapsed 9.78 sec

clang 3.3 / master
------------------
LOG: internal sort ended, ... CPU 1.32s/4.61u sec elapsed 10.00 sec
LOG: internal sort ended, ... CPU 1.27s/4.48u sec elapsed 9.95 sec
LOG: internal sort ended, ... CPU 1.35s/4.48u sec elapsed 9.99 sec
LOG: internal sort ended, ... CPU 1.32s/4.49u sec elapsed 9.97 sec
LOG: internal sort ended, ... CPU 1.32s/4.47u sec elapsed 10.04 sec

clang 3.3 / patched
-------------------
LOG: internal sort ended, ... CPU 1.35s/4.59u sec elapsed 10.13 sec
LOG: internal sort ended, ... CPU 1.31s/4.61u sec elapsed 10.06 sec
LOG: internal sort ended, ... CPU 1.28s/4.63u sec elapsed 10.10 sec
LOG: internal sort ended, ... CPU 1.27s/4.58u sec elapsed 10.01 sec
LOG: internal sort ended, ... CPU 1.29s/4.60u sec elapsed 10.05 sec

clang 3.5 / master
------------------
LOG: internal sort ended, ... CPU 1.30s/4.46u sec elapsed 9.96 sec
LOG: internal sort ended, ... CPU 1.30s/4.49u sec elapsed 9.96 sec
LOG: internal sort ended, ... CPU 1.30s/4.53u sec elapsed 9.95 sec
LOG: internal sort ended, ... CPU 1.25s/4.51u sec elapsed 9.95 sec
LOG: internal sort ended, ... CPU 1.30s/4.50u sec elapsed 9.95 sec

clang 3.5 / patched
-------------------
LOG: internal sort ended, ... CPU 1.32s/4.59u sec elapsed 9.97 sec
LOG: internal sort ended, ... CPU 1.27s/4.49u sec elapsed 9.91 sec
LOG: internal sort ended, ... CPU 1.29s/4.48u sec elapsed 9.88 sec
LOG: internal sort ended, ... CPU 1.31s/4.49u sec elapsed 9.93 sec
LOG: internal sort ended, ... CPU 1.23s/4.56u sec elapsed 9.90 sec

So while on clang 3.3 I really see about ~1% slowdown, on the other two
compilers it seems a tad faster (but it might easily be a noise).

It'd be interesting to know what compiler/version Robert used ...

>> One thought in either case is that we don't have to touch the API
>> for MemoryContextCreate: rather, the feature can be enabled in a
>> separate call after making the context.
>
> That seems fairly awkward to me because the pointer needs to be
> inherited from the parent context when not specified. I left the
> extra API call in.

I don't see a big difference between adding a new API "create" method
and a adding a separate "enable tracking" method.

What however seems awkward is using the 'uint64*' directly as a
parameter, instead of using 'enable_tracking' flag as in the previous
versions. Is there a reason for that?

This allows supplying a pointer to a uint64 variable, i.e. something
like this:

{
uint64 mem_used = 0;
MemoryContext context
= MemoryContextCreateTracked(..., &mem_used);

...

if (mem_used > work_mem * 1024L)
{
... do something ...
}
}

Thanks to exposing the total_mem like this, it's possible to get the
current value directly like this (without the API methods available in
the previous versions). It also makes it possible to "detach" the
accounting from the parent (i.e. not count it against the parent).

But is this intended/sane?

> The inheritance is awkward anyway, though. If you create a tracked
> context as a child of an already-tracked context, allocations in the
> newer one won't count against the original. I don't see a way around
> that without introducing even more performance problems.

I see this is as the main drawback of the current design :-(

The question is whether we can accept such limitation, and enforce it
somehow. For example if you request a new context with tracking, and you
find out the parent context already has tracking enabled, we can error
out. This would effectively make it "internal only" feature, because the
user code (e.g. custom aggregates in extensions) can't rely on getting
non-tracking context, thus making it impossible to use tracking. Maybe
that's acceptable limitation?

If we can get into such conflicting situation without any external code,
it's pretty much DOA. I can't think of such example, but am not
convinced there isn't any ...

Regarding the "performance problems" - ISTM it's important to
differentiate between impact on performance when no memory accounting is
enabled (but the code is there), and impact with the accounting.

In other words:

(a) free when not used (or as close to 'free' as possible)
(b) cheap when used (as cheap as possible)

I think the current patch achieves (a) - it's difficult to beat the
single (a != NULL) check.

We haven't seen any benchmarks for (b), at least not with the current
patch. But even if we introduce some additional overhead (and it's
impossible not to), we get something in return - the question is whether
the cost is adequate. Also, if we don't have this accounting, the
overhead for the local implementations may be easily higher.

> If this patch is unacceptable, my only remaining idea is to add the
> memory only for the current context with no inheritance (thereby
> eliminating the branch, also). That will require HashAgg to traverse all
> the child contexts to check whether the memory limit has been exceeded.
> As Tomas pointed out, that could be a lot of work in the case of
> array_agg with many groups.

I can't really imagine dropping the inheritance ...

Hierarchy of memory contexts with accounting that does not support
hierarchies seems rather strange IMNSHO, making it either useless or at
least very expensive in the interesting cases (the array_agg with a
context per group is a prime example).

ISTM the reason for the issues with inheritance (with two memory
contexts with tracking enabled) is the "single uint64 counter".

In one of my previous messages (from 23/8), I proposed to use this
struct to build "parallel" accounting hierarchy (tweaked to match the v6
patch):

typedef struct MemoryAccountingData {

uint64 track_mem;
struct MemoryAccountingData * parent;

} MemoryAccountingData;

Each context has either a private instance (if tracking was enabled), or
point to a nearest context with tracking enabled (if existing).

In the 'tracking disabled' case, this should have exactly the same
overhead as the v6 patch (still a single (pointer != NULL) check).

In the 'tracking enabled' case, it'll be more expensive, because the
changes need to propagate up the tree. But that happens only if there
are multiple contexts with tracking enabled - if there's a single one,
it should be exactly the same as the v6 patch.

Haven't tried measuring the impact though, because we don't have actual
code using it. The measurements in the 23/8 say that with palloc_bench,
the overhead was ~3% (compared to 'tracking disabled'). Also, all that
benchmark is doing is palloc, so it's not really +3% in practice.

A few minor comments regarding the patch:

(a) I see we're not tracking self/total values any more. I don't think
we really need those two numbers, the total is enough. Correct?

(b) both AllocSetReset/AllocSetDelete update the accounting info for
each block separately:

if (context->track_mem != NULL)
*context->track_mem -= block->endptr - ((char *) block);

Wouldn't it be better to accumulate the total into a local variable
and then update the track_mem at once? This might be more important
when using the MemoryAccountingData structure and multiple
tracking-enabled contexts, though.

(c) Most of the changes in AllocSetContextCreateTracked are only
renames 'context => set', which can be easily eliminated by using

AllocSet context = (AllocSet)MemoryContextCreate(...

and casting to MemoryContext at the one place where it's needed.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2014-10-19 01:54:29 Re: get_actual_variable_range vs idx_scan/idx_tup_fetch
Previous Message Tom Lane 2014-10-18 18:54:12 Re: initdb failure on RH 5.10