Quick Links

Scaling shared buffer eviction

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Scaling shared buffer eviction
Date:	2014-05-15 05:41:47
Message-ID:	CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN=+PjggfzQDukKr3q_DA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

As mentioned previously about my interest in improving shared
buffer eviction especially by reducing contention around
BufFreelistLock, I would like to share my progress about the
same.

The test used for this work is mainly the case when all the
data doesn't fit in shared buffers, but does fit in memory.
It is mainly based on previous comparison done by Robert
for similar workload:
http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html

To start with, I have taken LWLOCK_STATS report to confirm
the contention around BufFreelistLock and the data for HEAD
is as follows:

M/c details
IBM POWER-7 16 cores, 64 hardware threads
RAM - 64GB
Test
scale factor = 3000
shared_buffers = 8GB
number_of_threads = 64
duration = 5mins
./pgbench -c 64 -j 64 -T 300 -S postgres

LWLOCK_STATS data for BufFreeListLock
PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023

Here the high *blk* count for scale factor 3000 clearly shows
that to find a usable buffer when data doesn't fit in shared buffers
it has to wait.

To solve this issue, I have implemented a patch which makes
sure that there are always enough buffers on freelist such that
the need for backend to run clock-sweep is minimal, the
implementation idea is more or less same as discussed
previously in below thread, so I will explain it at end of mail.
http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com

LWLOCK_STATS data after Patch (test used is same as
used for HEAD):

BufFreeListLock
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

Here the low *exacq* and *blk* count shows that the need to
run clock sweep for backend has reduced significantly.

Performance Data
-------------------------------
shared_buffers= 8GB
number of threads - 64
sc - scale factor

sc tps
Head 3000 45569
Patch 3000 46457
Head 1000 93037
Patch 1000 92711

Above data shows that there is no significant change in
performance or scalability even after the contention is
reduced significantly around BufFreelistLock.

I have analyzed the patch both with perf record and
LWLOCK_STATS, both indicates that there is a high
contention around BufMappingLocks.

Data With perf record -a -g
-----------------------------------------

+ 10.14% swapper [kernel.kallsyms] [k]
.pseries_dedicated_idle_sleep
+ 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock
+ 6.88% postgres [kernel.kallsyms] [k]
.function_trace_call
+ 4.15% pgbench [kernel.kallsyms] [k] .try_to_wake_up
+ 3.20% swapper [kernel.kallsyms] [k]
.function_trace_call
+ 2.99% pgbench [kernel.kallsyms] [k]
.function_trace_call
+ 2.41% postgres postgres [.] AllocSetAlloc
+ 2.38% postgres [kernel.kallsyms] [k] .try_to_wake_up
+ 2.27% pgbench [kernel.kallsyms] [k] ._raw_spin_lock
+ 1.49% postgres [kernel.kallsyms] [k]
._raw_spin_lock_irq
+ 1.36% postgres postgres [.]
AllocSetFreeIndex
+ 1.09% swapper [kernel.kallsyms] [k] ._raw_spin_lock
+ 0.91% postgres postgres [.] GetSnapshotData
+ 0.90% postgres postgres [.]
MemoryContextAllocZeroAligned

Expanded graph
------------------------------

- 10.14% swapper [kernel.kallsyms] [k]
.pseries_dedicated_idle_sleep
- .pseries_dedicated_idle_sleep
- 10.13% .pseries_dedicated_idle_sleep
- 10.13% .cpu_idle
- 10.00% .start_secondary
.start_secondary_prolog
- 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock
- ._raw_spin_lock
- 6.63% ._raw_spin_lock
- 5.95% .double_rq_lock
- .load_balance
- 5.95% .__schedule
- .schedule
- 3.27% .SyS_semtimedop
.SyS_ipc
syscall_exit
semop
PGSemaphoreLock
LWLockAcquireCommon
- LWLockAcquire
- 3.27% BufferAlloc
ReadBuffer_common
- ReadBufferExtended
- 3.27% ReadBuffer
- 2.73% ReleaseAndReadBuffer
- 1.70% _bt_relandgetbuf
_bt_search
_bt_first
btgettuple

It shows BufferAlloc->LWLOCK as top contributor and we use
BufMappingLocks in BufferAlloc, I have checked other expanded
calls as well, StrategyGetBuffer is not present in top contributors.

Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.

As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.

Details of patch
------------------------
1. Changed bgwriter to move buffers (having usage_count as zero)
on free list based on threshold (high_watermark) and decrement the
usage count if usage_count is greater than zero.
2. StrategyGetBuffer() will wakeup bgwriter when the number of
buffers in freelist drop under low_watermark.
Currently I am using hard-coded values, we can choose to make
them as configurable later on if required.
3. Work done to get a buffer from freelist is done under spin lock
and clock sweep still runs under BufFreelistLock.

This is still a WIP patch and some of the changes are just kind
of prototype to check the idea, like I have hacked bgwriter code
such that it continuously fills the freelist till it is able to put
enough buffers on freelist such that it reaches high_watermark
and commented some part of previous code.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
scalable_buffer_eviction_v1.patch	application/octet-stream	14.2 KB

Responses

Re: Scaling shared buffer eviction at 2014-05-16 14:51:16 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Benedikt Grundmann	2014-05-15 07:19:16	Re: gettimeofday is at the end of its usefulness?
Previous Message	David Fetter	2014-05-15 04:25:17	Re: Freezing without write I/O