Re: Scaling shared buffer eviction

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-06-05 08:43:30
Message-ID: CAA4eK1+5bQh3KyO14Pqn+VuLex41V8cwt0kw6hRJASdcbaabtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>>
>>
>> Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
>> 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544
>>
>
> I haven't actually reviewed the code, but this sort of thing seems like
> good evidence that we need your patch, or something like it. The fact that
> the patch produces little performance improvement on it's own (though it
> does produce some) shouldn't be held against it - the fact that the
> contention shifts elsewhere when the first bottleneck is removed is not
> your patch's fault.
>
>
>
I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin

adding buffer's to freelist until it reaches high threshold and then

again goes back to sleep.

b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

c. Used the already existing bgwriterLatch in BufferStrategyControl to
wake bgwriter when number of buffer's in freelist drops below
threshold.

d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of
shared
buffer's decreases we need to be more careful as if we keep the number
as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will
evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.

e. One reason why I think number of buf-partitions is hard-coded to 16 is
that
minimum number of shared buffers allowed are 16 (128kb). However,
there
is handling in code (in function init_htab()) which ensure that even
if number
of partitions are more that shared buffers, it handles it safely.

I have checked the bgwriter CPU usage with and without patch
for various configurations and the observation is that for most of the
loads bgwriter's CPU usage after patch is between 8~20% and in
HEAD it is 0~2%. It shows that with patch when shared buffers
are under use by backends, bgwriter is constantly doing work to
ease the work of backends. Detailed data is provided later in the
mail.

Performance Data:
-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)

Duration of each individual run = 5mins

Client Count/patch_ver (tps) 8 16 32 64 128 Head 26220 48686 70779 45232
17310 Patch 26402 50726 75574 111468 114521

Data is taken by using script (pert_buff_mgmt.sh) attached with mail.
This data is read-only pgbench data with different number of client
connections. All the numbers are in tps. This data is median of 3
5 min pgbench read-only runs. Please find the detailed data for 3 runs
in attached open office document (perf_read_scalability_data_v3.ods)

This data clearly shows that patch has improved improved performance
upto 5~6 times.

Results of BGwriter CPU usage:
--------------------------------------------------

Here sc is scale factor and sb is shared buffers and the data is
for read-only pgbench runs.

./pgbench -c 64 - j 64 -S -T 300 postgres
sc - 3000, sb - 8GB
HEAD
CPU usage - 0~2.3%
Patch v_3
CPU usage - 8.6%

sc - 100, sb - 128MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 1~2%
tps- 36199.047132
Patch v_3
CPU usage - 12~13%
tps = 109182.681827

sc - 50, sb - 75MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 0.7~2%
tps- 37760.575128
Patch v_3
CPU usage - 20~22%
tps = 106310.744198

./pgbench -c 16 - j 16 -S -T 300 postgres
sc - 100, sb - 128kb
--need to change pgbench for this.
HEAD
CPU Usage - 0~0.3%
tps- 40979.529254
Patch v_3
CPU usage - 35~40%
tps = 42956.785618

Results of LWLOCK_STATS based on low-high threshold values of freelist:
--------------------------------------------------------------------------------------------------------------

In the results, values of exacq and blk shows the contention on freelist
lock.
sc is scale factor and sb is number of shared_buffers. Below results shows
that for all except one (1MB) of configuration the contention around
buffreelist
lock is reduced significantly. For 1MB case also, it has reduced exacq
count
which shows that it has performed clock sweep lesser number of times.

sc - 3000, sb - 15GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 4406 lwlock main 0: shacq 0 exacq 84482 blk 5139 spindelay 62
Patch v_3
PID 4864 lwlock main 0: shacq 0 exacq 34 blk 1 spindelay 0

sc - 3000, sb - 8GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 24124 lwlock main 0: shacq 0 exacq 285155 blk 33910 spindelay 548
Patch v_3
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

sc - 100, sb - 768MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 9144 lwlock main 0: shacq 0 exacq 284636 blk 34091 spindelay 555
Patch v-3 (lw=100,hg=1000)
PID 9428 lwlock main 0: shacq 0 exacq 306 blk 59 spindelay 0

sc - 100, sb - 128MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 5405 lwlock main 0: shacq 0 exacq 285449 blk 32345 spindelay 714
Patch v-3
PID 8625 lwlock main 0: shacq 0 exacq 740 blk 178 spindelay 0

sc - 50, sb - 75MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 12681 lwlock main 0: shacq 0 exacq 289347 blk 34064 spindelay 773
Patch v3
PID 12800 lwlock main 0: shacq 0 exacq 76287 blk 15183 spindelay 28

sc - 50, sb - 10MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 10283 lwlock main 0: shacq 0 exacq 287500 blk 32177 spindelay 864
Patch v3 (for > 1000, lw = 50 hg =200)
PID 11629 lwlock main 0: shacq 0 exacq 60139 blk 12978 spindelay 40

sc - 1, sb - 7MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 47127 lwlock main 0: shacq 0 exacq 289462 blk 37057 spindelay 119
Patch v3
PID 47283 lwlock main 0: shacq 0 exacq 9507 blk 1656 spindelay 0

sc - 1, sb - 1MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 43215 lwlock main 0: shacq 0 exacq 301384 blk 36740 spindelay 902
Patch v3
PID 46542 lwlock main 0: shacq 0 exacq 197231 blk 37532 spindelay 294

sc - 100, sb - 128kb focus(sb > 16)
./pgbench -c 16 - j 16 -S -T 300 postgres (for this, I need to reduce value
of naccounts to 2500, else it was always giving no unpinned buffers
available)
HEAD
PID 49751 lwlock main 0: shacq 0 exacq 1821276 blk 130119 spindelay 7
Patch v3
PID 50768 lwlock main 0: shacq 0 exacq 382610 blk 46543 spindelay 1

More Datapoints and work:
a. I have yet to take data by merging it with scalable lwlock patch of
Andres (https://commitfest.postgresql.org/action/patch_view?id=1313).
There are many conflicts in the patch, so waiting for an updated patch.
b. Read-only data for more configurations.
c. Data for Write work load (tpc-b of pgbench, Bulk insert (Copy))
d. Update docs and Remove unused code.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
scalable_buffer_eviction_v3.patch application/octet-stream 21.6 KB
perf_buff_mgmt.sh application/x-sh 1.6 KB
perf_read_scalability_data_v3.ods application/vnd.oasis.opendocument.spreadsheet 17.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-06-05 08:57:03 Re: tests for client programs
Previous Message Andres Freund 2014-06-05 08:42:53 Re: pg_receivexlog add synchronous mode