Re: Scaling shared buffer eviction

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-08-05 15:51:51
Message-ID: CA+TgmobCe7nyZ_3uA8WvmJbZvLtFiveT3EupMj4M=wYocbApew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> I have improved the patch by making following changes:
> a. Improved the bgwriter logic to log for xl_running_xacts info and
> removed the hibernate logic as bgwriter will now work only when
> there is scarcity of buffer's in free list. Basic idea is when the
> number of buffers on freelist drops below the low threshold, the
> allocating backend sets the latch and bgwriter wakesup and begin
>
> adding buffer's to freelist until it reaches high threshold and then
>
> again goes back to sleep.
>

This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist()
reveals that you've removed at least one behavior that some people (at
least, me) will care about, which is the guarantee that the background
writer will scan the entire buffer pool at least every couple of minutes.
This is important because it guarantees that dirty data doesn't sit in
memory forever. When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk. This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.

b. New stats for number of buffers on freelist has been added, some
> old one's like maxwritten_clean can be removed as new logic for
> syncing buffers and moving them to free list doesn't use them.
> However I think it's better to remove them once the new logic is
> accepted. Added some new logs for info related to free list under
> BGW_DEBUG
>

If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense. I think what you should be counting is
the number of allocations that are being satisfied from the free-list.
Then, by comparing the rate at which that value is incrementing to the rate
at which buffers_alloc is incrementing, somebody can figure out what
percentage of allocations are requiring a clock-sweep run. Actually, I
think it's better to flip it around: count the number of allocations that
require an individual backend to run the clock sweep (vs. being satisfied
from the free-list); call it, say, buffers_backend_clocksweep. We can then
try to tune the patch to make that number as small as possible under
varying workloads.

c. Used the already existing bgwriterLatch in BufferStrategyControl to
> wake bgwriter when number of buffer's in freelist drops below
> threshold.
>

Seems like a good idea.

> d. Autotune the low and high threshold for freelist for various
> configurations. Generally if keep small number (200~2000) of buffers
> always available on freelist, then even for high shared buffers
> like 15GB, it appears to be sufficient. However when the value
> of shared buffer's is less, then we need much smaller number. I
> think we can provide these as config knobs for user as well, but for
> now based on LWLOCK_STATS result, I have chosen some hard
> coded values for low and high threshold values for freelist.
> Values for low and high threshold have been decided based on total
> number of shared buffers, basically I have divided them into 5
> categories (16~100, 100~1000, 1000~10000, 10000~100000,
> 100000 and above) and then ran tests(read-only pgbench) for various
> configurations falling under these categories. The reason for keeping
> lesser categories for larger shared buffers is that if there are small
> number (200~2000) of buffers available on free list, then it seems to
> be sufficient for quite high loads, however as the total number of
> shared
> buffer's decreases we need to be more careful as if we keep the
> number as
> too low then it will lead to more clock sweep by backends (which means
> freelist lock contention) and if we keep number higher bgwriter will
> evict
> many useful buffers. Results based on LWLOCK_STATS is at end of mail.
>

I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants. And it definitely needs some comments
explaining the logic behind the choices.

Aside from those specific remarks, I think the elephant in the room is the
question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk. One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes). Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist. Once
we meet the second goal, we can go back to sleep.

But the other problem, which I think is likely unsolvable, is that writing
a dirty page can take a long time on a busy system (multiple seconds) and
the freelist can be emptied much, much quicker than that (milliseconds).
Although your benchmark results show great speed-ups on read-only
workloads, we're not really going to get the benefit consistently on
read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.

I'm wondering if it would be a whole lot simpler and better to introduce a
new background process, maybe with a name like bgreclaim. That process
wouldn't write dirty buffers. Instead, it would just run the clock sweep
(i.e. the last loop inside StrategyGetBuffer) and put the buffers onto the
free list. Then, we could leave the bgwriter logic more or less intact.
It certainly needs improvement, but that could be another patch.

Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK. I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the
LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save
the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist
if there is one, saving its identity. Release the spinlock. Then, set the
bgwriterLatch if needed. In the first loop, first check whether the buffer
we previously popped from the freelist is pinned or has a non-zero usage
count and return it if not, holding the buffer header lock. Otherwise,
reacquire the spinlock just long enough to pop a new potential victim and
then loop around.

Under this locking strategy, StrategyNotifyBgWriter would use
freelist_lck. Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock. You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-08-05 15:55:18 Re: Spinlocks and compiler/memory barriers
Previous Message Shaun Thomas 2014-08-05 14:44:19 Re: PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?