Re: Scaling shared buffer eviction

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling shared buffer eviction
Date: 2014-09-12 06:25:05
Message-ID: CAA4eK1+mYWp_mdDJ=h=ueTGe3DDJEfcE2Z8K7FMwiN26MZbGag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> > +++ b/src/backend/postmaster/bgreclaimer.c
>
> A fair number of comments in that file refer to bgwriter...

Will fix.

> > @@ -0,0 +1,302 @@
> >
+/*-------------------------------------------------------------------------
> > + *
> > + * bgreclaimer.c
> > + *
> > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.
It
> > + * attempts to keep regular backends from having to run clock sweep
(which
> > + * they would only do when they don't find a usable shared buffer from
> > + * freelist to read in another page).
>
> That's not really accurate. Freelist pages are often also needed to
> write new pages, without reading anything in.

Agreed, but the same is used in bgwriter file as well; so if we
change here, we might want to change bgwriter file header as well.

> I'd phrase it as "which
> they only need to do if they don't find a victim buffer from the
> freelist"

victim buffer sounds more like a buffer which it will get from
clock sweep, how about next candidate (same is used in function
header of StrategyGetBuffer()).

> > In the best scenario all requests
> > + * for shared buffers will be fulfilled from freelist as the background
> > + * reclaimer process always tries to maintain buffers on freelist.
However,
> > + * regular backends are still empowered to run clock sweep to find a
usable
> > + * buffer if the bgreclaimer fails to maintain enough buffers on
freelist.
>
> "empowered" sounds strange to me. 'still can run the clock sweep'?

No harm in changing like what you are suggesting, however the same is
used in file header of bgwriter.c as well, so I think lets keep this usage
as
it is because there is no correctness issue here.

> > + * The bgwriter is started by the postmaster as soon as the startup
subprocess
> > + * finishes, or as soon as recovery begins if we are doing archive
recovery.
>
> Why only archive recovery? I guess (only read this far...) it's not just
> during InArchiveRecoveryb recovery but also StandbyMode?

It will be for both.

> But I don't see
> why we shouldn't use it during normal crash recovery. That's also often
> painfully slow and the reclaimer could help? Less, but still.

Yes, it might improve a bit, however the main benefit with this patch is
under heavy load which means that it mainly addresses contention and
in case of crash recovery there will not be any contention because there
will be no backend processes.

Also I think to enable bgreclaimer during crash recovery, we need to have
one new signal which is not a problem, but it will make more sense to enable
it later based on benefit if it is there.

>
> > +static void bgreclaim_quickdie(SIGNAL_ARGS);
> > +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> > +static void ReqShutdownHandler(SIGNAL_ARGS);
> > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
>
> This looks inconsistent.

I have kept based on bgwriter, so not sure if it's good to change.
However I we want consistent in naming, I would like to keep
something like:

ReclaimShutdownHandler
ReclaimQuickDieHandler
..
..

> > + /*
> > + * If an exception is encountered, processing resumes here.
> > + *
> > + * See notes in postgres.c about the design of this coding.
> > + */
> > + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> > + {
> > + /* Since not using PG_TRY, must reset error stack by hand
*/
> > + error_context_stack = NULL;
..
>
> No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> good idea, regardless of it possibly being true today (which I'm not
> sure about yet).

I will add LWLockReleaseAll() in exception handling as discussed
elsewhere in thread.

> > +
> > + /* Now we can allow interrupts again */
> > + RESUME_INTERRUPTS();
>
> Other processes sleep for a second here, I think that's a good
> idea. E.g. that bit:

Agreed, will make change as per suggestion.

>
> > + /*
> > + * Loop forever
> > + */
> > + for (;;)
> > + {
> > + int rc;
> > +
> > +
> > + /*
> > + * Backend will signal bgreclaimer when the number of
buffers in
> > + * freelist falls below than low water mark of freelist.
> > + */
> > + rc = WaitLatch(&MyProc->procLatch,
> > + WL_LATCH_SET |
WL_POSTMASTER_DEATH,
> > + -1);
>
> That's probably not going to work well directly after a (re)start of
> bgreclaim (depending on how you handle the water mark, I'll see in a
> bit).

Could you please be more specific here?

>
> > +Background Reclaimer's Processing
> > +---------------------------------
..
> > +Two water mark indicators are used to maintain sufficient number of
buffers
> > +on freelist. Low water mark indicator is used by backends to wake
bgreclaimer
> > +when the number of buffers in freelist falls below it. High water mark
> > +indicator is used by bgreclaimer to move buffers to freelist.
>
> For me the description of the high water as stated here doesn't seem to
> explain anything.
>
> This section should have a description of how the reclaimer interacts
> with the bgwriter logic. Do we put dirty buffers on the freelist that
> are then cleaned by the bgwriter? Which buffers does the bgwriter write
> out?

As discussed in thread, I will change this accordingly.

> > /*
> > + * Move buffers with reference and usage_count as zero to freelist.
> > + * By maintaining enough number of buffers on freelist (equal to
> > + * high water mark of freelist), we drastically reduce the odds for
> > + * backend's to perform clock sweep.
>
> Move buffers with reference and a usage_count *of* zero to freelist. By
> maintaining enough buffers in the freelist (up to the list's high water
> mark), we drastically reduce the likelihood of individual backends
> having to perform the clock sweep themselves.

Okay will rephrase it as per your suggestion.

> > + * This is called by the background reclaim process when the number
> > + * of buffers in freelist falls below low water mark of freelist.
> > + */
>
> The logic used here *definitely* needs to be documented in another form
> somewhere in the source.

I think the way Robert has suggested to modify Readme adresses
this to an extent, however if you think it is better to go in more
detail, then I can expand function header of function
BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer?

> > +
> > + if (tmp_num_to_free == 0)
> > + break;
>
> num_to_free isn't a convincing name if I understand what this is doing
> correctly. Maybe 'move_to_freelist', 'freelist_needed',
> 'needed_on_freelist' or something like that?

I think keeping num or count in name could be more helpful as it is used
to loop for finding usable buffers. How about 'num_needed_on_freelist',
without any count or num it sounds more like a boolean variable.

>
> I wonder if we don't want to increase the high watermark when
> tmp_recent_backend_clocksweep > 0?
>
> > + while (tmp_num_to_free > 0)
> > + {
..
> > + /*
> > + * If the buffer is pinned or has a nonzero
usage_count, we cannot
> > + * move it to freelist; decrement the usage_count
(unless pinned)
> > + * and keep scanning.
> > + */
> > + LockBufHdr(bufHdr);
>
> Hm. Perhaps we should do a bufHdr->refcount != zero check without
> locking here? The atomic op will transfer the cacheline exclusively to
> the reclaimer's CPU. Even though it very shortly afterwards will be
> touched afterwards by the pinning backend.

As per discussion, the conclusion seems to be that we can do some
more tests to see if we need such a change, I will do more tests on
the lines suggested by Robert in below mails and then we can decide
if any thing is required.

> > +/*
> > + * Water mark indicators for maintaining buffers on freelist. When the
> > + * number of buffers on freelist drops below the low water mark, the
> > + * allocating backend sets the latch and bgreclaimer wakesup and begin
> > + * adding buffer's to freelist until it reaches high water mark and
then
> > + * again goes back to sleep.
> > + */
>
> s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
> /to freelist/to the freelist/; s/reaches high water/reaches the high
water/

Will change as per suggestions.

> > +int freelistLowWaterMark;
> > +int freelistHighWaterMark;
> > +
> > +/* Percentage indicators for maintaining buffers on freelist */
> > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005
> > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2
> > +#define MIN_HIGH_WATER_MARK 5
> > +#define MAX_HIGH_WATER_MARK 2000
>
> I'm confused. The high water mark percentage is smaller than the low
> water mark?

High water mark is percentage of NBuffers (shared_buffers).
Low water mark is percentage of High water mark.

I will add a comment.

> What's the reasoning for these numbers?

Based on experiments with different amount of shared buffers

> What's the justification for the
> max of 2k buffers for the high watermark? That's not much on a busy
> database with large s_b?

I have done tests at various loads upto 15GB of shared_buffers
and having data size as 44GB, 2000 turns out to be good enough
number of buffers on freelist(based on performance data).

Another thing I have mentioned in begining was to have guc for
these thresholds, but I think it will be much more difficult for
user to decide these values.

I think for now it is okay to fix these numbers as in patch
and we can anyway go back to add guc's or change them based
on more tests if anything turns out that needs further tuning.

> > - *lock_held = true;
> > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> > + SpinLockAcquire(&StrategyControl->freelist_lck);
> >
..
> > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
> > bgwriterLatch = StrategyControl->bgwriterLatch;
>
> I don't understand why these need to be grabbed under the spinlock?

In earlier version of patch, it was done without spinklock,
however Robert has given comment that as it was previously done
with BufFreelist lock, these should be under spinlock (atleast
that is what I understood) and I tested the patch again by having
them under spinlock and didn't find any difference, so I have moved
them under spinlock.

..
> > + if (numFreeListBuffers < freelistLowWaterMark &&
bgreclaimerLatch)
> > + SetLatch(StrategyControl->bgreclaimerLatch);
> > +
..
> > }
> > - UnlockBufHdr(buf);
> > }
>
> I think it makes sense to break out this bit into its own
> function. That'll make StrategyGetBuffer() a good bit easier to read.

I will try to move it into a new function GetBufferFromFreelist().

> > /* Nothing on the freelist, so run the "clock sweep" algorithm */
> > trycounter = NBuffers;
> > +
...
> > + buf = &BufferDescriptors[next_victim];
>
> I'd also move this into its own function, but thats's more debatable.

As we have not touched much of this part of code, so lets refactor
this code separately if required.

> > +/*
> > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
> > + */
> > +bool
> > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> > +{
>
> Should maybe rather be named *Tail?

Tail is better suited than End, I will change it.

> > + bool freed = false;
> > + SpinLockAcquire(&StrategyControl->freelist_lck);
> > +
> > + /*
> > + * It is possible that we are told to put something in the
freelist that
> > + * is already in it; don't screw up the list if so.
> > + */
>
> When/Why is that possible?

As we are doing clocksweep which can come across the same buffer
again incase it didn't find sufficient buffers to put on freelist in one
cycle, so this function needs to ensure the same.

> > +void
> > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32
*num_buf_alloc,
> > + uint32
*num_buf_backend_clocksweep)
> > +
> > + if (num_buf_alloc)
> > + {
> > + *num_buf_alloc = StrategyControl->numBufferAllocs;
> > + StrategyControl->numBufferAllocs = 0;
> > + }
> > + if (num_buf_backend_clocksweep)
> > + {
> > + *num_buf_backend_clocksweep =
StrategyControl->numBufferBackendClocksweep;
> > + StrategyControl->numBufferBackendClocksweep = 0;
> > + }
> > + SpinLockRelease(&StrategyControl->freelist_lck);
> > +
> > + return;
> > +}
>
> Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?

This API is more or less designed on lines of StrategySyncStart().
Currently there is no case where we can't do it unconditional (same is
true for num_buf_backend_clocksweep), however there is some merit
to keep it in sync with existing API. Having said that if you feel we
should
go about doing it unconditionally, I will change it in next patch.

>
> Some more general remarks:
> * I think there's a fair amount of unexplained heuristics in here

I think after addressing the comments given by you above,
the situation will be better.

> * Are we sure that the freelist_lck spinlock won't cause pain? Right now
> there will possibly be dozens of processes busily spinning on it... I
> think it's a acceptable risk, but we should think about it.

The situation seems to be better by having spinlock, rather than
by using LWLock, however if there is anything that causes pain,
I think we might want to consider partitioning the free list, but lets
keep that for other day.

> * I'm not convinced that these changes can be made without also changing
> the bgwriter logic. Have you measured whether there are differences in
> how effective the bgwriter is? Not that it's very effective right now :)

Current changes doesn't do anything to make bgwriter more or
less effective, however as discussed elsewhere in thread that it
makes sense for bgreclaimer to notify bgwriter in some situtions,
that can make bgwriter more effective than now, however as concluded
there lets do it in separate patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-09-12 06:46:06 Re: Turning off HOT/Cleanup sometimes
Previous Message Michael Paquier 2014-09-12 06:19:32 Re: Turning off HOT/Cleanup sometimes