Re: Scaling shared buffer eviction

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-16 12:18:35
Message-ID: CAA4eK1KHTX3wa34N7F_4vCnFWEBTO_J=ak2nDKL_ZzcrsGCL7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 14, 2014 at 12:23 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Fri, Sep 12, 2014 at 11:55 AM, Amit Chapel <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> wrote:
> > > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
>
> I will post the data with the latest patch separately (where I will focus
> on new cases discussed between Robert and Andres).
>
>
Performance Data with latest version of patch.
All the data shown below is a median of 3 runs, for each
individual run data, refer attached document
(perf_read_scalability_data_v9.ods)

Performance Data for Read-only test
-----------------------------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client_Count/Patch_Ver 8 16 32 64 128 HEAD 58614 107370 140717 104357
65010 sbe_v9 62943 119064 172246 220174 220904

Observations
---------------------
1. It scales well as with previous versions of patch, but
it seems the performance is slightly better in few cases,
may be because I have removed a statement (if check)
or 2 in bgreclaimer (those were done under spinlock) or it
could be just run-to-run difference.

> (1) A read-only pgbench workload that is just a tiny bit larger than
> shared_buffers, say size of shared_buffers plus 0.01%. Such workloads
> tend to stress buffer eviction heavily.

When the data is just tiny bit larger than shared buffers, actually
there is no problem in scalability even in HEAD, because I think
most of the requests will be satisfied from existing buffer pool.
I have taken data for some of the loads where database size is
bit larger than shared buffers and it is as follows:

Scale Factor - 800
Shared_Buffers - 12286MB (Total db size is 12288MB)

Client_Count/Patch_Ver 1 8 16 32 64 128 HEAD 8406 68712 132222 198481
290340 289828 sbe_v9 8504 68546 131926 195789 289959 289021

Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)

Client_Count/Patch_Ver 1 8 16 32 64 128 HEAD 8428 68609 128092 196596
292066 293812 sbe_v9 8386 68546 126926 197126 289959 287621

Observations
---------------------
In most cases performance with patch is slightly less as compare
to HEAD and the difference is generally less than 1% and in a case
or 2 close to 2%. I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required and it also leads to condition you mentioned
down (freelist will contain buffers that have already been touched since
we added them). Now for case 2 (12166MB), we need buffers more
than 2000 additional buffers, but not too many, so it can also have
similar effect.

I think we have below options related to this observation
a. Some further tuning in bgreclaimer, so that instead of putting
the buffers up to high water mark in freelist, it puts just 1/4th or
1/2 of high water mark and then check if the free list still contains
lesser than equal to low water mark, if yes it continues and if not
then it can wait (or may be some other way).
b. Instead of waking bgreclaimer when the number of buffers fall
below low water mark, wake when the number of times backends
does clock sweep crosses certain threshold
c. Give low and high water mark as config knobs, so that in some
rare cases users can use them to do tuning.
d. Lets not do anything as if user does such a configuration, he should
be educated to configure shared buffers in a better way and or the
performance hit doesn't seem to be justified to do any further
work.

Now if we do either of 'a' or 'b', then I think there is a chance
that the gain might not be same for cases where users can
easily get benefit from this patch and there is a chance that
it degrades the performance in some other case.

> (2) A workload that maximizes the rate of concurrent buffer eviction
> relative to other tasks. Read-only pgbench is not bad for this, but
> maybe somebody's got a better idea.

I think the first test of pgbench (scale_factor-3000;shared_buffers-8GB)
addresses this case.

> As I sort of mentioned in what I was writing for the bufmgr README,
> there are, more or less, three ways this can fall down, at least that
> I can see: (1) if the high water mark is too high, then we'll start
> finding buffers in the freelist that have already been touched since
> we added them:

I think I am able to see this effect (though mild) in one of above tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
perf_read_scalability_data_v9.ods application/vnd.oasis.opendocument.spreadsheet 18.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2014-09-16 12:20:41 Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Previous Message Andres Freund 2014-09-16 12:17:01 Re: Sequence Access Method WIP