Improving the memory allocator

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Improving the memory allocator
Date: 2011-04-25 21:45:32
Message-ID: 201104252345.33408.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In the thread http://archives.postgresql.org/message-id/4DA69D60.4000108@2ndquadrant.com /
"Single client performance on trivial SELECTs" I wondered whether it might be
possible to increase the performance of the current allocator by using some
slab allocator like concept.

While avoiding doing my tax returns and having a long and delayed train ride
I started to play around.

The profile I used in this case was:

pgbench -h /tmp/ -p5433 -s 30 pgbench -S -T 20

I hacked together a patch which records every allocation to a file. That patch
obviously slows everything down quite a bit so that I only get about a third
of the normal tps.

The current profile for that looks like that (hope thats recognizable). I
guess the only unclear column is "compile_time". Thats the result of
__builtin_constant_p(size) for the size argument of the various allocation
operations which is true for the case where size is known at compile time.

action | file | func | line | size | compile_time | count
--------------------------+------------------+--------------------------------------+-------+-------+--------------+--------
chunk_alloc | list.c | new_list | 68 | 16 | t | 940753
chunk_alloc | list.c | new_list | 72 | 24 | t | 940753
chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 202908
chunk_alloc_zero | bitmapset.c | bms_make_singleton | 189 | 8 | f | 129122
chunk_alloc_zero_aligned | value.c | makeString | 55 | 16 | t | 129122
chunk_alloc_zero_aligned | makefuncs.c | makeVar | 73 | 40 | t | 110676
chunk_alloc_zero_aligned | makefuncs.c | makeTargetEntry | 216 | 48 | t | 92230
chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 73799
chunk_alloc | bitmapset.c | bms_copy | 119 | 8 | f | 73784
chunk_alloc | lock.c | LockAcquireExtended | 559 | 128 | t | 55441
chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2036 | 40 | t | 55338
chunk_alloc | snapmgr.c | PushActiveSnapshot | 282 | 24 | t | 55338
chunk_alloc | nbtsearch.c | _bt_search | 121 | 24 | t | 36908
chunk_alloc | resowner.c | ResourceOwnerEnlargeCatCacheRefs | 612 | 128 | t | 36893
chunk_alloc | resowner.c | ResourceOwnerEnlargeRelationRefs | 754 | 128 | t | 36893
chunk_alloc_zero | resowner.c | ResourceOwnerCreate | 135 | 168 | t | 36893
chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2027 | 40 | t | 36892
chunk_alloc | snapmgr.c | CopySnapshot | 217 | 72 | f | 36892
chunk_alloc_zero_aligned | copyfuncs.c | _copyVar | 1041 | 40 | t | 36892
chunk_alloc_zero_aligned | equivclass.c | add_eq_member | 452 | 32 | t | 36892
chunk_alloc_zero_aligned | execQual.c | ExecInitExpr | 4231 | 24 | t | 36892
chunk_alloc_zero_aligned | execTuples.c | MakeTupleTableSlot | 118 | 96 | t | 36892
chunk_alloc_zero_aligned | gram.y | makeColumnRef | 12286 | 24 | t | 36892
chunk_alloc | list.c | new_head_cell | 93 | 16 | t | 18529
chunk_alloc | genam.c | RelationGetIndexScan | 77 | 104 | t | 18489
chunk_alloc | nbtree.c | btbeginscan | 385 | 6608 | t | 18489
chunk_alloc | tupdesc.c | CreateTemplateTupleDesc | 63 | 160 | f | 18479
chunk_alloc | genam.c | RelationGetIndexScan | 89 | 72 | f | 18471
chunk_alloc | nbtree.c | btbeginscan | 388 | 72 | f | 18471
chunk_alloc | resowner.c | ResourceOwnerEnlargeBuffers | 524 | 64 | t | 18448
chunk_alloc | trigger.c | AfterTriggerBeginXact | 3607 | 112 | t | 18447
chunk_alloc | trigger.c | AfterTriggerBeginXact | 3619 | 192 | t | 18447

splitted to contexts:

self | action | file | func | line | size | compile_time | count
-----------------------------------------+--------------------------+------------------+--------------------------------------+-------+-------+--------------+--------
MessageContext | chunk_alloc | list.c | new_list | 68 | 16 | t | 848520
MessageContext | chunk_alloc | list.c | new_list | 72 | 24 | t | 848520
MessageContext | chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 166016
MessageContext | chunk_alloc_zero | bitmapset.c | bms_make_singleton | 189 | 8 | f | 129122
MessageContext | chunk_alloc_zero_aligned | value.c | makeString | 55 | 16 | t | 129122
MessageContext | chunk_alloc_zero_aligned | makefuncs.c | makeVar | 73 | 40 | t | 110676
ExecutorState | chunk_alloc | list.c | new_list | 68 | 16 | t | 92230
ExecutorState | chunk_alloc | list.c | new_list | 72 | 24 | t | 92230
MessageContext | chunk_alloc_zero_aligned | makefuncs.c | makeTargetEntry | 216 | 48 | t | 92230
MessageContext | chunk_alloc | bitmapset.c | bms_copy | 119 | 8 | f | 73784
TopMemoryContext | chunk_alloc | lock.c | LockAcquireExtended | 559 | 128 | t | 55441
MessageContext | chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2036 | 40 | t | 55338
TopTransactionContext | chunk_alloc | snapmgr.c | PushActiveSnapshot | 282 | 24 | t | 55338
MessageContext | chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 36894
TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeCatCacheRefs | 612 | 128 | t | 36893
TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeRelationRefs | 754 | 128 | t | 36893
TopMemoryContext | chunk_alloc_zero | resowner.c | ResourceOwnerCreate | 135 | 168 | t | 36893
ExecutorState | chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 36892
ExecutorState | chunk_alloc | nbtsearch.c | _bt_search | 121 | 24 | t | 36892
ExecutorState | chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 36892
ExecutorState | chunk_alloc_zero_aligned | execQual.c | ExecInitExpr | 4231 | 24 | t | 36892
ExecutorState | chunk_alloc_zero_aligned | execTuples.c | MakeTupleTableSlot | 118 | 96 | t | 36892
MessageContext | chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2027 | 40 | t | 36892
MessageContext | chunk_alloc_zero_aligned | copyfuncs.c | _copyVar | 1041 | 40 | t | 36892
MessageContext | chunk_alloc_zero_aligned | equivclass.c | add_eq_member | 452 | 32 | t | 36892
MessageContext | chunk_alloc_zero_aligned | gram.y | makeColumnRef | 12286 | 24 | t | 36892
TopTransactionContext | chunk_alloc | snapmgr.c | CopySnapshot | 217 | 72 | f | 36892
TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeBuffers | 524 | 64 | t | 18448
ExecutorState | chunk_alloc | genam.c | RelationGetIndexScan | 77 | 104 | t | 18447
ExecutorState | chunk_alloc | genam.c | RelationGetIndexScan | 89 | 72 | f | 18447
ExecutorState | chunk_alloc | nbtree.c | btbeginscan | 385 | 6608 | t | 18447
ExecutorState | chunk_alloc | nbtree.c | btbeginscan | 388 | 72 | f | 18447
MessageContext | chunk_alloc | list.c | new_head_cell | 93 | 16 | t | 18447
TopTransactionContext | chunk_alloc | trigger.c | AfterTriggerBeginXact | 3607 | 112 | t | 18447
TopTransactionContext | chunk_alloc | trigger.c | AfterTriggerBeginXact | 3619 | 192 | t | 18447

1:
For that allocation profile I found that one significant part of the costs is
AllocSetFreeIndex (which is inlined in optimized mode, but still).

If we would move more stuff to the headers those very common cases with
compile time known sizes could eliminate the AllocSetFreeIndex computation
at runtime as its only dependent on the size.

The problem with that is obviously that it would violate the whole mctx.c
abstraction as it has to be known at compile time which memory manager is
used.

Are we willing to reduce that abstraction?

2:
aset.c fragments horribly for longliving contexts. I haven't generated
sensibly accessible data for that as I currently just store it in an sql
database and sql (at least my queries :P) isn't really that efficient
for figuring out when how much was allocated from that:
Table "public.allocations"
Column | Type | Modifiers
--------------+---------+----------------------------------------------------------
id | integer | not null default nextval('allocations_id_seq'::regclass)
action | text |
parent | text |
self | text |
size | bigint |
addr | bigint |
file | text |
func | text |
line | integer |
compile_time | boolean |

schema when the data is sensibly large. A sensible benchmark easily produces
three digit million rows...

I guess some custom implementation for evaluating that data should be way much
better at that.

That fragmentation causes two main problems. 1 space wastage and 2. (more
importantly imo) cache misses.

3:
Given the frequentness of very small allocations the current space overhead
of at least 16 bytes on 64bit Platforms seems quite high.

I don't think its too hard to devise a scheme (actually I have started that)
where the overhead is only sizeof(*void). But again that would reduce the
the abstraction because it would make it harder to implement something like
pfree/repalloc/GetMemoryChunkContext which need to figure out the context
from a pointer.

If we had a separate api for small allocations we could even devise a schema
where a single chunk wouldn't have any memory overhead, but that seems to
complicated for now.

So after all this my question basically is: How important do we think the
mctx.c abstraction is?
I personally found the speed of AllocSetAlloc being the most limiting factor
in several workloads so I am willing to sacrifice some abstraction to gain
speed here. Especially as I hope its possible to write a single allocator
which is "good enough" for everything.

I currently started working on two pieces:
* A new memory manager implementation. For that the equivalent of
AllocSetFreeIdx seems to be the biggest limit so far. Its not really ready
for anything yet though.

* recording and replaying MemoryContext* to get a somewhat reasonable
and reproducable micro benchmark. That unfortunately doesn't model the
effects of cache misses that well but I got no better idea so far.

Any opinions, input, ideas?

Thanks for reading,

Andres

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-04-25 23:00:15 Re: branching for 9.2devel
Previous Message Darren Duncan 2011-04-25 21:28:50 Re: "stored procedures" - use cases?