Re: Memory usage during sorting

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-02-11 19:34:31
Message-ID: CAMkU=1yjkNRNu8HwfaZ-H=dqcoRLdYHApJT=6q73Lj-Rif2K4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:
> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>> The attached patch allows it to reuse that memory.  On my meager
>> system it reduced the building of an index on an integer column in a
>> skinny 200 million row totally randomly ordered table by about 3% from
>> a baseline of 25 minutes.
>>
>
> Just to give a standard review, this patch is one line change and
> applies cleanly, builds ok.
>
> I'm not pretty sure what exactly you're trying to accomplish, but it
> seems to me that it's avoiding the first dumptuples cycle by forcing
> availMem = 0 even if it's negative.

Yes. Currently when it switches to the TSS_BUILDRUNS part of a
tape-sort, it starts by calling WRITETUP a large number of time
consecutively, to work off the memory deficit incurred by the 3 blocks
per tape of tape overhead, and then after that calls WRITETUP about
once per puttuple.. Under my patch, it would only call WRITETUP
about once per puttuple, right from the beginning.

> I read your comments as it'd be
> avoiding to alternate reading/writing back and force with scattered
> memory failing memory cache much during merge phase, but actually it
> doesn't affect merge phase but only init-dump phase, does it?

It effects the building of the runs. But this building of the runs is
not a simple dump, it is itself a mini merge phase, in which it merges
the existing in-memory priority queue against the still-incoming
tuples from the node which invoked the sort. By using less memory
than it could, this means that the resulting runs are smaller than
they could be, and so will sometimes necessitate an additional layer
of merging later on. (This effect is particularly large for the very
first run being built. Generally by merging incoming tuples into the
memory-tuples, you can create runs that are 1.7 times the size of fits
in memory. By wasting some memory, we are getting 1.7 the size of a
smaller starting point. But for the first run, it is worse than that.
Most of the benefit that leads to that 1.7 multiplier comes at the
very early stage of each run-build. But by initially using the full
memory, then writing out a bunch of tuples without doing any merge of
the incoming, we have truncated the part that gives the most benefit.)

My analysis that the "freed" memory is never reused (because we refuse
to reuse it ourselves and it is too fragmented to be reused by anyone
else, like the palloc or VM system) only applies to the run-building
phase. So never was a bit of an overstatement. By the time the last
initial run is completely written out to tape, the heap used for the
priority queue should be totally empty. So at this point the
allocator would have the chance to congeal all of the fragmented
memory back into larger chunks, or maybe it parcels the allocations
back out again in an order so that the unused space is contiguous and
could be meaningfully paged out.

But, it is it worth worrying about how much we fragment memory and if
we overshoot our promises by 10 or 20%?

> If so,
> I'm not so convinced your benchmark gave 3 % gain by this change.
> Correct me as I'm probably wrong.

I've now done more complete testing. Building an index on an
200,000,000 row table with an integer column populated in random order
with integers from 1..500,000,000, non-unique, on a machine with 2GB
of RAM and 600MB of shared_buffers.

It improves things by 6-7 percent at the end of working mem size, the
rest are in the noise except at 77936 KB, where it reproducibly makes
things 4% worse, for reasons I haven't figured out. So maybe the best
thing to do is, rather than micromanaging memory usage, simply don't
set maintenance_work_mem way to low. (But, it is the default).

maintenance_work_mem %Change with patch
16384 6.2
19484 7.8
23170 -0.1
27554 0.1
32768 1.9
38968 -1.9
46341 0.4
55109 0.4
65536 0.6
77936 -4.3
92682 -0.3
110218 0.1
131072 1.7

> Anyway, it's nice to modify the comment just above the change, since
> it's now true with it.

Much of that comment is referring to other things (some of which I
don't understand). How about an additional comment just before my
code to the gist of:

/* If we have already used, and thus fragmented, the memory then we
* might as well continue to make use of it as no one else can.
*/

(That might not actually be true, if the tuples we are sorting are
very large, or perhaps if they arrive in already reverse sorted order)

Thanks,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-02-11 20:10:01 Re: plpgsql leaking memory when stringifying datums
Previous Message Kevin Grittner 2012-02-11 18:36:21 Re: [v9.2] Add GUC sepgsql.client_label