Quick Links

Re: Scaling shared buffer eviction

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Scaling shared buffer eviction
Date:	2014-09-14 06:53:07
Message-ID:	CAA4eK1JmORgyXNpFKtRyeLwSvwMRrHZSHYM1eBp2TnL1SPQVhg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
> > > +++ b/src/backend/postmaster/bgreclaimer.c
> >
> > A fair number of comments in that file refer to bgwriter...
>
> Will fix.

Fixed.

> > > @@ -0,0 +1,302 @@
> > >
+/*-------------------------------------------------------------------------
> > > + *
> > > + * bgreclaimer.c
> > > + *
> > > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.
It
> > > + * attempts to keep regular backends from having to run clock sweep
(which
> > > + * they would only do when they don't find a usable shared buffer
from
> > > + * freelist to read in another page).
> >
> > That's not really accurate. Freelist pages are often also needed to
> > write new pages, without reading anything in.
>
> Agreed, but the same is used in bgwriter file as well; so if we
> change here, we might want to change bgwriter file header as well.

I have just changed bgreclaimer for this comment, if the same
is required for bgwriter, I can create a separate patch as that
change is not related to this patch, so I thought it is better
to keep it separate.

> > I'd phrase it as "which
> > they only need to do if they don't find a victim buffer from the
> > freelist"
>
> victim buffer sounds more like a buffer which it will get from
> clock sweep, how about next candidate (same is used in function
> header of StrategyGetBuffer()).

Fixed.

> > > In the best scenario all requests
> > > + * for shared buffers will be fulfilled from freelist as the
background
> > > + * reclaimer process always tries to maintain buffers on freelist.
However,
> > > + * regular backends are still empowered to run clock sweep to find a
usable
> > > + * buffer if the bgreclaimer fails to maintain enough buffers on
freelist.
> >
> > "empowered" sounds strange to me. 'still can run the clock sweep'?
>
> No harm in changing like what you are suggesting, however the same is
> used in file header of bgwriter.c as well, so I think lets keep this
usage as
> it is because there is no correctness issue here.

Not changed anything for this comment.

> >
> > > +static void bgreclaim_quickdie(SIGNAL_ARGS);
> > > +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
> > > +static void ReqShutdownHandler(SIGNAL_ARGS);
> > > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
> >
> > This looks inconsistent.
>
> I have kept based on bgwriter, so not sure if it's good to change.
> However I we want consistent in naming, I would like to keep
> something like:
>
> ReclaimShutdownHandler
> ReclaimQuickDieHandler
> ..
> ..

Changed function names to make them consistent.

> > > + /*
> > > + * If an exception is encountered, processing resumes here.
> > > + *
> > > + * See notes in postgres.c about the design of this coding.
> > > + */
> > > + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> > > + {
> > > + /* Since not using PG_TRY, must reset error stack by
hand */
> > > + error_context_stack = NULL;
> ..
> >
> > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
> > good idea, regardless of it possibly being true today (which I'm not
> > sure about yet).
>
> I will add LWLockReleaseAll() in exception handling as discussed
> elsewhere in thread.

Done.

>
> > > +
> > > + /* Now we can allow interrupts again */
> > > + RESUME_INTERRUPTS();
> >
> > Other processes sleep for a second here, I think that's a good
> > idea. E.g. that bit:
>
> Agreed, will make change as per suggestion.

Done.

> >
> > > + /*
> > > + * Loop forever
> > > + */
> > > + for (;;)
> > > + {
> > > + int rc;
> > > +
> > > +
> > > + /*
> > > + * Backend will signal bgreclaimer when the number of
buffers in
> > > + * freelist falls below than low water mark of freelist.
> > > + */
> > > + rc = WaitLatch(&MyProc->procLatch,
> > > + WL_LATCH_SET |
WL_POSTMASTER_DEATH,
> > > + -1);
> >
> > That's probably not going to work well directly after a (re)start of
> > bgreclaim (depending on how you handle the water mark, I'll see in a
> > bit).
>
> Could you please be more specific here?

I wasn't sure if any change is required here, so kept the code
as it is.

> >
> > > +Background Reclaimer's Processing
> > > +---------------------------------
> ..
> > > +Two water mark indicators are used to maintain sufficient number of
buffers
> > > +on freelist. Low water mark indicator is used by backends to wake
bgreclaimer
> > > +when the number of buffers in freelist falls below it. High water
mark
> > > +indicator is used by bgreclaimer to move buffers to freelist.
> >
> > For me the description of the high water as stated here doesn't seem to
> > explain anything.
> >
> > This section should have a description of how the reclaimer interacts
> > with the bgwriter logic. Do we put dirty buffers on the freelist that
> > are then cleaned by the bgwriter? Which buffers does the bgwriter write
> > out?
>
> As discussed in thread, I will change this accordingly.

Done

> > Move buffers with reference and a usage_count *of* zero to freelist. By
> > maintaining enough buffers in the freelist (up to the list's high water
> > mark), we drastically reduce the likelihood of individual backends
> > having to perform the clock sweep themselves.
>
> Okay will rephrase it as per your suggestion.

Done

> > > + * This is called by the background reclaim process when the number
> > > + * of buffers in freelist falls below low water mark of freelist.
> > > + */
> >
> > The logic used here *definitely* needs to be documented in another form
> > somewhere in the source.
>
> I think the way Robert has suggested to modify Readme adresses
> this to an extent, however if you think it is better to go in more
> detail, then I can expand function header of function
> BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer?
>
> > > +
> > > + if (tmp_num_to_free == 0)
> > > + break;
> >
> > num_to_free isn't a convincing name if I understand what this is doing
> > correctly. Maybe 'move_to_freelist', 'freelist_needed',
> > 'needed_on_freelist' or something like that?
>
> I think keeping num or count in name could be more helpful as it is used
> to loop for finding usable buffers. How about 'num_needed_on_freelist',
> without any count or num it sounds more like a boolean variable.

Changed.

> >
> > I wonder if we don't want to increase the high watermark when
> > tmp_recent_backend_clocksweep > 0?
> >
> > > + while (tmp_num_to_free > 0)
> > > + {
> ..
> > > + /*
> > > + * If the buffer is pinned or has a nonzero
usage_count, we cannot
> > > + * move it to freelist; decrement the
usage_count (unless pinned)
> > > + * and keep scanning.
> > > + */
> > > + LockBufHdr(bufHdr);
> >
> > Hm. Perhaps we should do a bufHdr->refcount != zero check without
> > locking here? The atomic op will transfer the cacheline exclusively to
> > the reclaimer's CPU. Even though it very shortly afterwards will be
> > touched afterwards by the pinning backend.
>
> As per discussion, the conclusion seems to be that we can do some
> more tests to see if we need such a change, I will do more tests on
> the lines suggested by Robert in below mails and then we can decide
> if any thing is required.

Still performance data related to this needs to be collected.

> > > +/*
> > > + * Water mark indicators for maintaining buffers on freelist. When
the
> > > + * number of buffers on freelist drops below the low water mark, the
> > > + * allocating backend sets the latch and bgreclaimer wakesup and
begin
> > > + * adding buffer's to freelist until it reaches high water mark and
then
> > > + * again goes back to sleep.
> > > + */
> >
> > s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
> > /to freelist/to the freelist/; s/reaches high water/reaches the high
water/
>
> Will change as per suggestions.

Done

> > > +int freelistLowWaterMark;
> > > +int freelistHighWaterMark;
> > > +
> > > +/* Percentage indicators for maintaining buffers on freelist */
> > > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005
> > > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2
> > > +#define MIN_HIGH_WATER_MARK 5
> > > +#define MAX_HIGH_WATER_MARK 2000
> >
> > I'm confused. The high water mark percentage is smaller than the low
> > water mark?
>
> High water mark is percentage of NBuffers (shared_buffers).
> Low water mark is percentage of High water mark.
>
> I will add a comment.

Added a comment.

>
>
> > > - *lock_held = true;
> > > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> > > + SpinLockAcquire(&StrategyControl->freelist_lck);
> > >
> ..
> > > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
> > > bgwriterLatch = StrategyControl->bgwriterLatch;
> >
> > I don't understand why these need to be grabbed under the spinlock?
>
> In earlier version of patch, it was done without spinklock,
> however Robert has given comment that as it was previously done
> with BufFreelist lock, these should be under spinlock (atleast
> that is what I understood) and I tested the patch again by having
> them under spinlock and didn't find any difference, so I have moved
> them under spinlock.

Not changed anything related to this.

> ..
> > > + if (numFreeListBuffers < freelistLowWaterMark &&
bgreclaimerLatch)
> > > + SetLatch(StrategyControl->bgreclaimerLatch);
> > > +
> ..
> > > }
> > > - UnlockBufHdr(buf);
> > > }
> >
> > I think it makes sense to break out this bit into its own
> > function. That'll make StrategyGetBuffer() a good bit easier to read.
>
> I will try to move it into a new function GetBufferFromFreelist().

Moved this part of code into new function.

>
> > > +/*
> > > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of
freelist
> > > + */
> > > +bool
> > > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
> > > +{
> >
> > Should maybe rather be named *Tail?
>
> Tail is better suited than End, I will change it.

Changed.

> > > +void
> > > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32
*num_buf_alloc,
> > > + uint32
*num_buf_backend_clocksweep)
> > > +
> > > + if (num_buf_alloc)
> > > + {
> > > + *num_buf_alloc = StrategyControl->numBufferAllocs;
> > > + StrategyControl->numBufferAllocs = 0;
> > > + }
> > > + if (num_buf_backend_clocksweep)
> > > + {
> > > + *num_buf_backend_clocksweep =
StrategyControl->numBufferBackendClocksweep;
> > > + StrategyControl->numBufferBackendClocksweep = 0;
> > > + }
> > > + SpinLockRelease(&StrategyControl->freelist_lck);
> > > +
> > > + return;
> > > +}
> >
> > Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?
>
> This API is more or less designed on lines of StrategySyncStart().
> Currently there is no case where we can't do it unconditional (same is
> true for num_buf_backend_clocksweep), however there is some merit
> to keep it in sync with existing API. Having said that if you feel we
should
> go about doing it unconditionally, I will change it in next patch.

I have made it unconditional.

> Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
> ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
> up being continously busy without being visible.

Done and I have removed pgstat_send_bgwriter() call from
bgreclaimer loop, as after this change calling it there becomes
redundant.

Apart from this, I have changed kid of newly added function as
due to recent commit, the oid I was using is no longer available.

I have taken tpc-b data as well, it is with previous version of patch,
however the recent version hasn't changed much to impact performance
data. It takes long time to get tpc-b data, thats why I could not report
it previously.

I will post the data with the latest patch separately (where I will focus
on new cases discussed between Robert and Andres).

TPC-B Performance Data
----------------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
wal_buffers = 256MB
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 30mins

Client_count/Patch_ver 32 64 128 HEAD 420 535 556 Patch (sbe_v8) 431 537
568

About performance data
-------------------------------------
1. This data is a median of 3 runs, individual run data can
be found in attached document (perf_read_scalability_data_v8.ods)
2. There is not much difference in performance between Head and patch
which shows that this patch hasn't regressed tpc-b case.

Steps to take tpc-b data, for each run:
1. start server
2. drop db
3. create db
4. initialize db
5. run tpc-b pgbench (used prepared statement)
6. checkpoint
7. stop server

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
perf_read_scalability_data_v8.ods	application/vnd.oasis.opendocument.spreadsheet	17.3 KB
scalable_buffer_eviction_v9.patch	application/octet-stream	62.4 KB

In response to

Re: Scaling shared buffer eviction at 2014-09-12 06:25:05 from Amit Kapila

Responses

Re: Scaling shared buffer eviction at 2014-09-16 12:18:35 from Amit Kapila
Re: Scaling shared buffer eviction at 2014-09-20 08:14:40 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Mark Kirkwood	2014-09-14 06:57:01	Re: Postgres code for a query intermediate dataset
Previous Message	Peter Geoghegan	2014-09-14 05:32:04	Re: PoC: Partial sort