Re: Scaling shared buffer eviction

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-11 11:01:43
Message-ID: 20140911110143.GV24649@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> include $(top_srcdir)/src/backend/common.mk
> diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
> new file mode 100644
> index 0000000..3df2337
> --- /dev/null
> +++ b/src/backend/postmaster/bgreclaimer.c

A fair number of comments in that file refer to bgwriter...

> @@ -0,0 +1,302 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bgreclaimer.c
> + *
> + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It
> + * attempts to keep regular backends from having to run clock sweep (which
> + * they would only do when they don't find a usable shared buffer from
> + * freelist to read in another page).

That's not really accurate. Freelist pages are often also needed to
write new pages, without reading anything in. I'd phrase it as "which
they only need to do if they don't find a victim buffer from the
freelist"

> In the best scenario all requests
> + * for shared buffers will be fulfilled from freelist as the background
> + * reclaimer process always tries to maintain buffers on freelist. However,
> + * regular backends are still empowered to run clock sweep to find a usable
> + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.

"empowered" sounds strange to me. 'still can run the clock sweep'?

> + * The bgwriter is started by the postmaster as soon as the startup subprocess
> + * finishes, or as soon as recovery begins if we are doing archive recovery.

Why only archive recovery? I guess (only read this far...) it's not just
during InArchiveRecoveryb recovery but also StandbyMode? But I don't see
why we shouldn't use it during normal crash recovery. That's also often
painfully slow and the reclaimer could help? Less, but still.

> +static void bgreclaim_quickdie(SIGNAL_ARGS);
> +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> +static void ReqShutdownHandler(SIGNAL_ARGS);
> +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

This looks inconsistent.

> + /*
> + * If an exception is encountered, processing resumes here.
> + *
> + * See notes in postgres.c about the design of this coding.
> + */
> + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> + {
> + /* Since not using PG_TRY, must reset error stack by hand */
> + error_context_stack = NULL;
> +
> + /* Prevent interrupts while cleaning up */
> + HOLD_INTERRUPTS();
> +
> + /* Report the error to the server log */
> + EmitErrorReport();
> +
> + /*
> + * These operations are really just a minimal subset of
> + * AbortTransaction(). We don't have very many resources to worry
> + * about in bgreclaim, but we do have buffers and file descriptors.
> + */
> + UnlockBuffers();
> + AtEOXact_Buffers(false);
> + AtEOXact_Files();

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

> + /*
> + * Now return to normal top-level context and clear ErrorContext for
> + * next time.
> + */
> + MemoryContextSwitchTo(bgreclaim_context);
> + FlushErrorState();
> +
> + /* Flush any leaked data in the top-level context */
> + MemoryContextResetAndDeleteChildren(bgreclaim_context);
> +
> + /* Now we can allow interrupts again */
> + RESUME_INTERRUPTS();

Other processes sleep for a second here, I think that's a good
idea. E.g. that bit:
/*
* Sleep at least 1 second after any error. A write error is likely
* to be repeated, and we don't want to be filling the error logs as
* fast as we can.
*/
pg_usleep(1000000L);

> + /*
> + * Loop forever
> + */
> + for (;;)
> + {
> + int rc;
> +
> +
> + /*
> + * Backend will signal bgreclaimer when the number of buffers in
> + * freelist falls below than low water mark of freelist.
> + */
> + rc = WaitLatch(&MyProc->procLatch,
> + WL_LATCH_SET | WL_POSTMASTER_DEATH,
> + -1);

That's probably not going to work well directly after a (re)start of
bgreclaim (depending on how you handle the water mark, I'll see in a
bit). Maybe it should rather be

ResetLatch();
BgMoveBuffersToFreelist();
pgstat_send_bgwriter();
rc = WaitLatch()
if (rc & WL_POSTMASTER_DEATH)
exit(1)

> +Background Reclaimer's Processing
> +---------------------------------
> +
> +The background reclaimer is designed to move buffers to freelist that are
> +likely to be recycled soon, thereby offloading the need to perform
> +clock sweep work from active backends. To do this, it runs the clock sweep
> +and move the the unpinned and zero usage count buffers to freelist. It
> +keeps on doing this until the number of buffers in freelist reaches the
> +high water mark.
> +
> +Two water mark indicators are used to maintain sufficient number of buffers
> +on freelist. Low water mark indicator is used by backends to wake bgreclaimer
> +when the number of buffers in freelist falls below it. High water mark
> +indicator is used by bgreclaimer to move buffers to freelist.

For me the description of the high water as stated here doesn't seem to
explain anything.

This section should have a description of how the reclaimer interacts
with the bgwriter logic. Do we put dirty buffers on the freelist that
are then cleaned by the bgwriter? Which buffers does the bgwriter write
out?

> /*
> + * Move buffers with reference and usage_count as zero to freelist.
> + * By maintaining enough number of buffers on freelist (equal to
> + * high water mark of freelist), we drastically reduce the odds for
> + * backend's to perform clock sweep.

Move buffers with reference and a usage_count *of* zero to freelist. By
maintaining enough buffers in the freelist (up to the list's high water
mark), we drastically reduce the likelihood of individual backends
having to perform the clock sweep themselves.

> + * This is called by the background reclaim process when the number
> + * of buffers in freelist falls below low water mark of freelist.
> + */

The logic used here *definitely* needs to be documented in another form
somewhere in the source.

> +void
> +BgMoveBuffersToFreelist(void)
> +{
> + uint32 num_to_free = 0;
> + uint32 recent_alloc = 0;
> + uint32 recent_backend_clocksweep = 0;
> + volatile uint32 next_victim = 0;
> +
> + /* Execute the clock sweep */
> + for (;;)
> + {
> + uint32 tmp_num_to_free;
> + uint32 tmp_recent_alloc;
> + uint32 tmp_recent_backend_clocksweep;
> +
> + StrategyGetFreelistAccessInfo(&tmp_num_to_free,
> + &tmp_recent_alloc,
> + &tmp_recent_backend_clocksweep);
> +
> + num_to_free += tmp_num_to_free;
> + recent_alloc += tmp_recent_alloc;
> + recent_backend_clocksweep += tmp_recent_backend_clocksweep;
> +
> + if (tmp_num_to_free == 0)
> + break;

num_to_free isn't a convincing name if I understand what this is doing
correctly. Maybe 'move_to_freelist', 'freelist_needed',
'needed_on_freelist' or something like that?

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?

> + while (tmp_num_to_free > 0)
> + {
> + volatile BufferDesc *bufHdr;
> + bool add_to_freelist = false;
> +
> + /*
> + * Choose next victim buffer to look if that can be moved
> + * to freelist.
> + */
> + StrategySyncNextVictimBuffer(&next_victim);
> +
> + bufHdr = &BufferDescriptors[next_victim];
> +
> + /*
> + * If the buffer is pinned or has a nonzero usage_count, we cannot
> + * move it to freelist; decrement the usage_count (unless pinned)
> + * and keep scanning.
> + */
> + LockBufHdr(bufHdr);

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

> +/*
> + * Water mark indicators for maintaining buffers on freelist. When the
> + * number of buffers on freelist drops below the low water mark, the
> + * allocating backend sets the latch and bgreclaimer wakesup and begin
> + * adding buffer's to freelist until it reaches high water mark and then
> + * again goes back to sleep.
> + */

s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
/to freelist/to the freelist/; s/reaches high water/reaches the high water/

> +int freelistLowWaterMark;
> +int freelistHighWaterMark;
> +
> +/* Percentage indicators for maintaining buffers on freelist */
> +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005
> +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2
> +#define MIN_HIGH_WATER_MARK 5
> +#define MAX_HIGH_WATER_MARK 2000

I'm confused. The high water mark percentage is smaller than the low
water mark?

What's the reasoning for these numbers? What's the justification for the
max of 2k buffers for the high watermark? That's not much on a busy
database with large s_b?

> - *lock_held = true;
> - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> + SpinLockAcquire(&StrategyControl->freelist_lck);
>
> /*
> - * We count buffer allocation requests so that the bgwriter can estimate
> - * the rate of buffer consumption. Note that buffers recycled by a
> - * strategy object are intentionally not counted here.
> + * We count buffer allocation requests so that the bgwriter or bgreclaimer
> + * can know the rate of buffer consumption and report it as stats. Note
> + * that buffers recycled by a strategy object are intentionally not counted
> + * here.
> */
> StrategyControl->numBufferAllocs++;
>
> /*
> - * If bgwriterLatch is set, we need to waken the bgwriter, but we should
> - * not do so while holding BufFreelistLock; so release and re-grab. This
> - * is annoyingly tedious, but it happens at most once per bgwriter cycle,
> - * so the performance hit is minimal.
> + * Remember the values of bgwriter and bgreclaimer latch so that they can
> + * be set outside spin lock and try to get a buffer from the freelist.
> */
> + bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
> bgwriterLatch = StrategyControl->bgwriterLatch;

I don't understand why these need to be grabbed under the spinlock?

> if (bgwriterLatch)
> - {
> StrategyControl->bgwriterLatch = NULL;
> - LWLockRelease(BufFreelistLock);
> - SetLatch(bgwriterLatch);
> - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> - }
>
> - /*
> - * Try to get a buffer from the freelist. Note that the freeNext fields
> - * are considered to be protected by the BufFreelistLock not the
> - * individual buffer spinlocks, so it's OK to manipulate them without
> - * holding the spinlock.
> - */
> - while (StrategyControl->firstFreeBuffer >= 0)
> + numFreeListBuffers = StrategyControl->numFreeListBuffers;
> +
> + if (StrategyControl->firstFreeBuffer >= 0)
> {
> buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
> Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
> @@ -169,35 +198,82 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
> /* Unconditionally remove buffer from freelist */
> StrategyControl->firstFreeBuffer = buf->freeNext;
> buf->freeNext = FREENEXT_NOT_IN_LIST;
> + --StrategyControl->numFreeListBuffers;
> + }
> + else
> + StrategyControl->numBufferBackendClocksweep++;
> +
> + SpinLockRelease(&StrategyControl->freelist_lck);

> + /* If bgwriterLatch is set, we need to waken the bgwriter */
> + if (bgwriterLatch)
> + SetLatch(bgwriterLatch);
> +
> + /*
> + * If the number of free buffers has fallen below the low water mark,
> + * awaken the bgreclaimer to repopulate it. bgreclaimerLatch is initialized in
> + * early phase of BgReclaimer startup, however we still check before using
> + * it to avoid any problem incase we reach here before its initializion.
> + */
> + if (numFreeListBuffers < freelistLowWaterMark && bgreclaimerLatch)
> + SetLatch(StrategyControl->bgreclaimerLatch);
> +
> + if (buf != NULL)
> + {
> /*
> - * If the buffer is pinned or has a nonzero usage_count, we cannot use
> - * it; discard it and retry. (This can only happen if VACUUM put a
> - * valid buffer in the freelist and then someone else used it before
> - * we got to it. It's probably impossible altogether as of 8.3, but
> - * we'd better check anyway.)
> + * Try to get a buffer from the freelist. Note that the freeNext fields
> + * are considered to be protected by the freelist_lck not the
> + * individual buffer spinlocks, so it's OK to manipulate them without
> + * holding the buffer spinlock.
> */
> - LockBufHdr(buf);
> - if (buf->refcount == 0 && buf->usage_count == 0)
> + for(;;)
> {
> - if (strategy != NULL)
> - AddBufferToRing(strategy, buf);
> - return buf;
> + /*
> + * If the buffer is pinned or has a nonzero usage_count, we cannot use
> + * it; discard it and retry.
> + */
> + LockBufHdr(buf);
> + if (buf->refcount == 0 && buf->usage_count == 0)
> + {
> + if (strategy != NULL)
> + AddBufferToRing(strategy, buf);
> + return buf;
> + }
> + UnlockBufHdr(buf);
> +
> + SpinLockAcquire(&StrategyControl->freelist_lck);
> +
> + if (StrategyControl->firstFreeBuffer >= 0)
> + {
> + buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
> + Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
> +
> + /* Unconditionally remove buffer from freelist */
> + StrategyControl->firstFreeBuffer = buf->freeNext;
> + buf->freeNext = FREENEXT_NOT_IN_LIST;
> + --StrategyControl->numFreeListBuffers;
> +
> + SpinLockRelease(&StrategyControl->freelist_lck);
> + }
> + else
> + {
> + StrategyControl->numBufferBackendClocksweep++;
> + SpinLockRelease(&StrategyControl->freelist_lck);
> + break;
> + }
> }
> - UnlockBufHdr(buf);
> }

I think it makes sense to break out this bit into its own
function. That'll make StrategyGetBuffer() a good bit easier to read.

> /* Nothing on the freelist, so run the "clock sweep" algorithm */
> trycounter = NBuffers;
> +
> for (;;)
> {
> - buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
> + volatile uint32 next_victim;
>
> - if (++StrategyControl->nextVictimBuffer >= NBuffers)
> - {
> - StrategyControl->nextVictimBuffer = 0;
> - StrategyControl->completePasses++;
> - }
> + StrategySyncNextVictimBuffer(&next_victim);
> +
> + buf = &BufferDescriptors[next_victim];

I'd also move this into its own function, but thats's more debatable.

> /*
> * If the buffer is pinned or has a nonzero usage_count, we cannot use
> @@ -241,7 +317,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
> void
> StrategyFreeBuffer(volatile BufferDesc *buf)
> {
> - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> + SpinLockAcquire(&StrategyControl->freelist_lck);
>
> /*
> * It is possible that we are told to put something in the freelist that
> @@ -253,11 +329,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
> if (buf->freeNext < 0)
> StrategyControl->lastFreeBuffer = buf->buf_id;
> StrategyControl->firstFreeBuffer = buf->buf_id;
> + ++StrategyControl->numFreeListBuffers;
> + }
> +
> + SpinLockRelease(&StrategyControl->freelist_lck);
> +}
> +
> +/*
> + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
> + */
> +bool
> +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> +{

Should maybe rather be named *Tail?

> + bool freed = false;
> + SpinLockAcquire(&StrategyControl->freelist_lck);
> +
> + /*
> + * It is possible that we are told to put something in the freelist that
> + * is already in it; don't screw up the list if so.
> + */

When/Why is that possible?

> /*
> + * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer
> + * to move unused buffers to freelist.
> + *
> + * The result is the number of buffers that are required to be moved to
> + * freelist and count of recent buffer allocs and buffer allocs not
> + * satisfied from freelist.
> + */
> +void
> +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
> + uint32 *num_buf_backend_clocksweep)
> +{
> + int curfreebuffers;
> +
> + SpinLockAcquire(&StrategyControl->freelist_lck);
> + curfreebuffers = StrategyControl->numFreeListBuffers;
> + if (curfreebuffers < freelistHighWaterMark)
> + *num_buf_to_free = freelistHighWaterMark - curfreebuffers;
> + else
> + *num_buf_to_free = 0;
> +
> + if (num_buf_alloc)
> + {
> + *num_buf_alloc = StrategyControl->numBufferAllocs;
> + StrategyControl->numBufferAllocs = 0;
> + }
> + if (num_buf_backend_clocksweep)
> + {
> + *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
> + StrategyControl->numBufferBackendClocksweep = 0;
> + }
> + SpinLockRelease(&StrategyControl->freelist_lck);
> +
> + return;
> +}

Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?

Some more general remarks:
* I think there's a fair amount of unexplained heuristics in here
* Are we sure that the freelist_lck spinlock won't cause pain? Right now
there will possibly be dozens of processes busily spinning on it... I
think it's a acceptable risk, but we should think about it.
* I'm not convinced that these changes can be made without also changing
the bgwriter logic. Have you measured whether there are differences in
how effective the bgwriter is? Not that it's very effective right now :)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2014-09-11 11:14:33 Re: Patch to support SEMI and ANTI join removal
Previous Message Heikki Linnakangas 2014-09-11 10:52:44 Re: Fix MSVC isnan warning from e80252d4