Re: Scaling shared buffer eviction

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-08-06 15:15:01
Message-ID: CA+TgmoY2EE=LrVeC8U3foH=im=Vnm5LeKuK2tJzpdt+gCBQ_yQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> If I'm reading this right, the new statistic is an incrementing counter
>> where, every time you update it, you add the number of buffers currently on
>> the freelist. That makes no sense.
>
> I think using 'number of buffers currently on the freelist' and
> 'number of recently allocated buffers' for consecutive cycles,
> we can figure out approximately how many buffer allocations
> needs clock sweep assuming low and high threshold water
> marks are fixed. However there can be cases where it is not
> easy to estimate that number.

Counters should be design in such a way that you can read it, and then
read it again later, and make sense of it - you should not need to
read the counter on *consecutive* cycles to interpret it.

>> I think what you should be counting is the number of allocations that are
>> being satisfied from the free-list. Then, by comparing the rate at which
>> that value is incrementing to the rate at which buffers_alloc is
>> incrementing, somebody can figure out what percentage of allocations are
>> requiring a clock-sweep run. Actually, I think it's better to flip it
>> around: count the number of allocations that require an individual backend
>> to run the clock sweep (vs. being satisfied from the free-list); call it,
>> say, buffers_backend_clocksweep. We can then try to tune the patch to make
>> that number as small as possible under varying workloads.
>
> This can give us clear idea to tune the patch, however we need
> to maintain 3 counters for it in code(recent_alloc (needed for
> current bgwriter logic) and other 2 suggested by you). Do you
> want to retain such counters in code or it's for kind of debug info
> for patch?

I only mean to propose one new counter, and I'd imagine including that
in the final patch. We already have a counter of total buffer
allocations; that's buffers_alloc. I'm proposing to add an additional
counter for the number of those allocations not satisfied from the
free list, with a name like buffers_alloc_clocksweep (I said
buffers_backend_clocksweep above, but that's probably not best, as the
existing buffers_backend counts buffer *writes*, not allocations). I
think we would definitely want to retain this counter in the final
patch, as an additional column in pg_stat_bgwriter.

>>> d. Autotune the low and high threshold for freelist for various
>>> configurations.
>>
>> I think we need to come up with some kind of formula here rather than just
>> a list of hard-coded constants.
>
> That was my initial intention as well and I have tried based
> on number of shared buffers like keeping threshold values as
> percentage of shared buffers but nothing could satisfy different
> kind of workloads. The current values I have choosen are based
> on experiments for various workloads at different thresholds. I have
> shown the lwlock_stats data for various loads based on current
> thresholds upthread. Another way could be to make them as config
> knobs and use the values as given by user incase it is provided by
> user else go with fixed values.

How did you go about determining the optimal value for a particular workload?

When the list is kept short, it's less likely that a value on the list
will be referenced or dirtied again before the page is actually
recycled. That's clearly good. But when the list is long, it's less
likely to become completely empty and thereby force individual
backends to run the clock-sweep. My suspicion is that, when the
number of buffers is small, the impact of the list being too short
isn't likely to be very significant, because running the clock-sweep
isn't all that expensive anyway - even if you have to scan through the
entire buffer pool multiple times, there aren't that many buffers.
But when the number of buffers is large, those repeated scans can
cause a major performance hit, so having an adequate pool of free
buffers becomes much more important.

I think your list of high-watermarks is far too generous for low
buffer counts. With more than 100k shared buffers, you've got a
high-watermark of 2k buffers, which means that 2% or less of the
buffers will be on the freelist, which seems a little on the high side
to me, but probably in the ballpark of what we should be aiming for.
But at 10001 shared buffers, you can have 1000 of them on the
freelist, which is 10% of the buffer pool; that seems high. At 101
shared buffers, 75% of the buffers in the system can be on the
freelist; that seems ridiculous. The chances of a buffer still being
unused by the time it reaches the head of the freelist seem very
small.

Based on your existing list of thresholds, and taking the above into
account, I'd suggest something like this: let the high-watermark for
the freelist be 0.5% of the total number of buffers, with a maximum of
2000 and a minimum of 5. Let the low-watermark be 20% of the
high-watermark. That might not be best, but I think some kind of
formula like that can likely be made to work. I would suggest
focusing your testing on configurations with *large* settings for
shared_buffers, say 1-64GB, rather than small configurations. Anyone
who cares greatly about performance isn't going to be running with
only 8MB of shared_buffers anyway. Arguably we shouldn't even run the
reclaim process on very small configurations; I think there should
probably a GUC (PGC_SIGHUP) to control whether it gets launched.

I think it would be a good idea to analyze how frequently the reclaim
process gets woken up. In the worst case, this happens once per (high
watermark - low watermark) allocations; that is, the system reaches
the low watermark and then does no further allocations until the
reclaim process brings the freelist back up to the high watermark.
But if more allocations occur between the time the reclaim process is
woken and the time it reaches the high watermark, then it should run
for longer, until the high watermark is reached. At least for
debugging purposes, I think it would be useful to have a counter of
reclaim wakeups. I'm not sure whether that's worth including in the
final patch, but it might be.

> That will certainly help in retaining the current behaviour of
> bgwriter and make the idea cleaner. I will modify the patch
> to have a new background process unless somebody thinks
> otherwise.
>
> If we go with this approach, one thing which we need to decide
> is what to do incase buf which has usage_count as zero is *dirty*,
> as I don't think it is good idea to put it in freelist.

I thought a bit about this yesterday. I think the problem is that we
might be in a situation where buffers are being dirtied faster than
they can be cleaned. In that case, if we only put clean buffers on the
freelist, then every backend in the system will be fighting over the
ever-dwindling supply of clean buffers until, in the worst case,
there's maybe only 1 clean buffer which is getting evicted repeatedly
at top speed - or maybe even no clean buffers, and the reclaim process
just spins in an infinite loop looking for clean buffers that aren't
there.

To put that another way, the rate at which buffers are being dirtied
can't exceed the rate at which they are being cleaned forever.
Eventually, somebody is going to have to wait. Having the backends
wait by being forced to write some dirty buffers does not seem like a
bad way to accomplish that. So I favor just putting the buffers on
freelist without regard to whether they are clean or dirty. If this
turns out not to work well we can look at other options (probably some
variant of (b) from your list).

>> Instead, it would just run the clock sweep (i.e. the last loop inside
>> StrategyGetBuffer) and put the buffers onto the free list.
>
> Don't we need to do more than just last loop inside StrategyGetBuffer(),
> as clock sweep in strategy get buffer is responsible for getting one
> buffer with usage_count = 0 where as we need to run the loop till it
> finds and moves enough such buffers so that it can populate freelist
> with number of buffers equal to high water mark of freelist.

Yeah, that's what I meant. Of course, it should add each buffer to
the freelist individually, not batch them up and add them all at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-08-06 15:20:50 Re: Proposal: Incremental Backup
Previous Message Fabien COELHO 2014-08-06 14:38:44 Re: add modulo (%) operator to pgbench