Just-in-time Background Writer Patch+Test Results

Lists: pgsql-hackers
From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-06 03:31:56
Message-ID: Pine.GSO.4.64.0709052324020.25284@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom gets credit for naming the attached patch, which is my latest attempt to
finalize what has been called the "Automatic adjustment of
bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but
that's where it started.

Background on testing
---------------------

I decided to use pgbench for running my tests. The scripting framework to
collect all that data and usefully summarize it is now available as
pgbench-tools-0.2 at
http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

I hope to expand and actually document use of pgbench-tools in the future but
didn't want to hold the rest of this up on that work. That page includes basic
information about what my testing environment was and why I felt this was an
appropriate way to test background writer efficiency.

Quite a bit of raw data for all of the test sets summarized here is at
http://www.westnet.com/~gsmith/content/bgwriter/

The patches attached to this message are also available at:
http://www.westnet.com/~gsmith/content/postgresql/buf-alloc-2.patch
http://www.westnet.com/~gsmith/content/postgresql/jit-cleaner.patch
(This is my second attempt to send this message, don't know why the
earlier one failed; using gzip'd patches for this one and hopefully there
won't be a dupe)

Baseline test results
---------------------

The first patch to apply attached to this message is the latest buf-alloc-2
that adds counters to pgstat_bgwriter for everything the background writer is
doing. Here's what we get out of the standard 8.3 background writer before and
after applying that patch, at various settings:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
HEAD nobgwriter | 5 | 994 |
HEAD+buf-alloc-2 nobgwriter | 6 | 1012 | 0
HEAD+buf-alloc-2 LRU=0.5%/500 | 16 | 974 | 15.94
HEAD+buf-alloc-2 LRU=5%/500 | 19 | 983 | 98.47
HEAD+buf-alloc-2 LRU=10%/500 | 7 | 997 | 99.95

cleaner_pct is what percentage of the writes the BGW LRU cleaner did relative
to a total that includes the client backend writes; writes done by checkpoints
are not included in this summary computation, it just shows the balance of
backend vs. BGW writes.

The /500 means bgwriter_lru_maxpages=500, which I already knew was about as
many pages as this server ever dirties in a 200ms cycle. Without the
buf-alloc-2 patch I don't get statistics on the LRU cleaner, I include that
number as a baseline just to suggest that the buf-alloc-2 patch itself isn't
pulling down results.

Here we see that in order to get most of the writes to happen via the LRU
cleaner rather than having the backends handle them, you'd need to play with
the settings until the bgwriter_lru_percent was somewhere between 5% and 10%,
and it seems obvious that doing this doesn't improve the TPS results. The
margin of error here is big enough that I consider all these basically the same
performance. The question then is how to get this high level of writes by the
background writer automatically, without having to know what percentage to
scan; I wanted to remove bgwriter_lru_percent, while still keeping
bgwriter_lru_maxpages strictly as a way to throttle overall BGW activity.

First JIT Implementation
------------------------

The method I described in my last message on this topic (
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00887.php ) implemented
a weighted moving average of how many pages were allocated, and based on
feedback from that I improved the code to allow a multiplier factor on top of
that. Here's the summary of those results:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
jit cleaner multiplier=1.0/500 | 9 | 981 | 94.3
jit cleaner multiplier=2.0/500 | 8 | 1005 | 99.78
jit multiplier=1.0/100 | 10 | 985 | 68.14

That's pretty good. As long as maxpages is set intelligently, it gets most of
the writes even with the multiplier of 1.0, and cranking it up to the 2.0
suggested by the original Itagaki Takahiro patch gets nearly all of them.
Again, there's really no performance change here in throughput by any of this.

Coping with idle periods
------------------------

While I was basically happy with these results, the data Kevin Grittner
submitted in response to my last call for commentary left me concerned. While
the JIT approach works fine as long as your system is active, it does
absolutely nothing if the system is idle. I noticed that a lot of the writes
that were being done by the client backends were after idle periods where the
JIT writer just didn't react fast enough during the ramp-up. For example, if
the system went from idle for a while to full-speed just as the 200ms sleep
started, by the time the BGW woke up again the backends could have needed to
write many buffers already themselves.

Ideally, idle periods should be used to slowly trickly dirty pages out, so that
there are less of them hanging around when a checkpoint shows up or so that
reusable pages are already available. The question then is how fast to go about
that trickle. Heikki's background writer tests and my own suggest that if you
make the rate during quiet periods too high, you'll clog the underlying buffers
with some writes that end up being duplicated and lower overall efficiency.
But all of those tests had the background writer going at a constant and
relatively high speed.

I wanted to keep the ability to scan the entire buffer cache, using the latest
idea of never looking at the same buffer twice, but to do that slowly when idle
and using the JIT rate otherwise. This is sort of a hybrid of the old LRU
cleaner behavior (scan a fixed %) at a low speed with the new approach (scan
based on allocations, however many of them there are). I starting with the old
default of 0.5% used by bgwriter_lru_percent (a tunable already removed by the
patch at this point) with logic to tack that onto the JIT intelligently and got
these results:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
jit multiplier=1.0 min scan=0.5% | 13 | 882 | 100
jit multiplier=1.5 min scan=0.5% | 12 | 871 | 100
jit multiplier=2.0 min scan=0.5% | 11 | 910 | 100
jit multiplier=1.0 min scan=0.25% | 14 | 982 | 98.34

It's nice to see fully 100% of the buffers written by the cleaner with the
hybrid approach; I feel that validates my idea that just a bit more work needs
to be done during idle periods to completely fix the issue with it not reacting
fast enough during the idle/full speed transition. But look at the drop in
TPS. While I'm willing to say a couple of percent change isn't significant in
a pgbench result, those <900 results are clearly bad. This is crossing that
line where inefficient writes are being done. I'm happier with the result
using the smaller min scan=0.25% even though it doesn't quite get every write
that way.

Making percentage independant of delay
--------------------------------------

But a new problem here is that if you lower bgwriter_delay, the minimum scan
percentage needs to drop too, and my goal was to remove the number of tunables
people need to tinker with. Assuming you're not stopped by the maxpages
parameter, with the default delay=200ms a scan that hits 0.5% each time will
scan 5*0.5%=2.5% of the buffer cache per second, which means it will take 24
seconds to scan the entire pool. Using 0.25% means 48 seconds between scans. I
improved the overall algorithm a bit and decided to set this parameter an
alternate way: by how long it should take to creep its way through the entire
buffer cache if the JIT code is idle. I decided I liked 120 seconds as value
for that parameter, which is a slower rate than any of the above but still a
reasonable one for a typical application. Here's what the results look like
using that approach:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
jit multiplier=1.0 scan_whole=120s | 18 | 970 | 99.99
jit multiplier=1.5 scan_whole=120s | 15 | 995 | 99.93
jit multiplier=2.0 scan_whole=120s | 17 | 981 | 99.98

Now here are results I'm happy with. The TPS results are almost unchanged from
where we started from, with minimal inefficient writes, but almost all the
writes are being done by the cleaner process. The results appear much less
sensitive to what you set the multiplier to. And unless you use an unresonable
low value for maxpages (which will quickly become obvious if you monitor
pg_stat_bgwriter and look for maxwritten_clean increasing fast), you'll get a
complete scan of the buffer cache within 2 minutes even if there's no system
activity. But once that's done, until more buffers are allocated the code
won't even look at the buffer cache again (as opposed to the current code,
which is always looking at buffers and acquiring locks even if nothing is going
on).

I think I can safely say there is a level of intelligence going into what the
LRU background writer does with this patch that has never been applied to this
problem before. There have been a lot of good ideas thrown out in this area,
but it took a hybrid approach that included and carefully balanced all of them
to actually get results that I felt were usable. What I don't know is whether
that will also be true for other testers.

Patch review
------------

The attached jit-cleaner.patch implements this approach, and if you just want
to look at the main code involved without having to apply the patch you can
browse the BgBufferSync function in bufmgr.c starting around line 1120 at
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c

There is lots of debugging of internals dumped into the logs if you toggle on
#define BGW_DEBUG , the gross summary of the two most important things that
show what the code is doing are logged at DEBUG1 (but should probably be pushed
lower before committing).

This code is as good as you're going to get from me before the 8.3 close. I
could do some small rewriting and certainly can document all this further as
part of getting this patch moved toward committed, but I'm out of resources to
do too much more here. Along with the big question of whether this whole idea
is worth following at all as part of 8.3, here are the remaining small
questions I feel review feedback would be valuable on related to my specific
code:

-The way I'm getting the passes number back from the freelist.c strategy code
seems like it will eventually overflow the long I'm using for the intermediate
results when I execute statements like this:

strategy_position=(long)strategy_passes * NBuffers + strategy_buf_id;

I'm not sure if the code would be better if I were to use a 64-bit integer for
strategy_position instead, or if I should just rewrite the code to separate out
the passes multiplication--which will make it less elegant to read but should
make overflow issues go away.

-Heikki didn't like the way I pass information back from SyncOneBuffer back to
the background writer. The bitmask approach I'm using has added flexibility to
writing more intelligent background writers in the future. I have written more
complicated ones than any of the approaches mentioned here in the past, using
things like the usage_count information returned, but the simpler
implementation here that ignores that. I could simplify this interface if I
had to, but I like what I've done as a solid structure for future coding as
it's written right now.

-There are two magic constants in the code:

int smoothing_samples = 16;
float scan_whole_pool_seconds = 120.0;

I believe I've done enough testing recently and in the past to say these are
reasonable numbers for most installations, and high-throughput systems are
going to care more about tuning the multiplier GUC than either of these. In
the interest of having less knobs people can fool with and break, I personally
don't feel like these constants need to be exposed for tuning purposes; they
don't have a significant impact on how the underlying model works. Determining
whether these should be exposed as GUC tunables is certainly an open question
though.

-I bumped the default for bgwriter_lru_maxpages to 100 so that typical low-end
systems should get an automatically tuning LRU background writer out of the box
in 8.3. This is a big change from the 5 that was used in the older releases.
If you keep everything at the defaults this represents a maximum theoretical
write rate for the BGW of 4MB/s, which isn't very much relative to modern
hardware.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

Attachment Content-Type Size
jit-cleaner.patch.gz application/octet-stream 6.5 KB
buf-alloc-2.patch.gz application/octet-stream 4.6 KB

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-06 14:20:31
Message-ID: 46DFC65F.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Wed, Sep 5, 2007 at 10:31 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0709052324020(dot)25284(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
>
> -There are two magic constants in the code:
>
> int smoothing_samples = 16;
> float scan_whole_pool_seconds = 120.0;
>

> I personally
> don't feel like these constants need to be exposed for tuning purposes;

> Determining
> whether these should be exposed as GUC tunables is certainly an open
> question though.

If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
allay all of my concerns about this patch. Basically, our problems were
resolved by getting all dirty buffers out to the OS cache within two
seconds; any longer than that and the OS cache didn't reach its trigger
point for pushing out to the controller cache in time to prevent the glut
which locks everything up. I also suspect that this interval kept the OS
cache more aware of frequently updated pages, so that it could avoid
unnecessary physical writes under its own logic.

While I'm hoping that the new checkpoint techniques will be a better
solution, I can't count on that without significant testing in our
environment, and I really want a fall-back. The metric you emphasized was
the percentage of PostgreSQL writes to the OS cache which were handled by
the background writer, which doesn't necessarily correspond to a solution
to the glut, which is based on the peak number of total writes presented
to the controller by the OS within a small window of time.

-Kevin


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-06 16:27:44
Message-ID: Pine.GSO.4.64.0709061121020.14491@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 6 Sep 2007, Kevin Grittner wrote:

> If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
> allay all of my concerns about this patch. Basically, our problems were
> resolved by getting all dirty buffers out to the OS cache within two
> seconds

Unfortunately it wouldn't make my concerns about your system go away or
I'd have recommended exposing it specifically to address your situation.
I have been staring carefully at your configuration recently, and I would
wager that you could turn off the LRU writer altogether and still meet
your requirements in 8.2. Here's what you've got right now:

> shared_buffers = 160MB (=20000 buffers)
> bgwriter_lru_percent = 20.0
> bgwriter_lru_maxpages = 200
> bgwriter_all_percent = 10.0
> bgwriter_all_maxpages = 600

With the default delay of 200ms, this has the LRU-writer scanning the
whole pool every 1 second, while the all-writer scans every two
seconds--assuming they don't hit the write limits. If some event were to
dirty the whole pool in 200ms, it might take as much as 6.7 seconds to
write everything out (20000 / 600 * 200 ms) via the all-scan. The
all-scan is already gone in 8.3. Your LRU scan will take much longer than
that to clear everything out. At least (20000 / 200 * 200ms) 20 seconds
to clear a fully dirty cache.

But in fact, it's impossible to even bound how long it will take before
the LRU writer (which is the only part this new patch tries to improve)
gets around to writing even a single dirty buffer no matter what
bgwriter_lru_percent (8.2) or scan_whole_pool_seconds (JIT patch) is set
to.

There's a second low-level issue involved here. When a page becomes
dirty, that implies it was also recently used, which means the LRU writer
won't touch it. That page can't be written out by the LRU writer until an
entire pass has been made over the shared_buffer pool while looking for
buffers to allocate for new activity. When the allocation clock-sweep
passes over the newly dirtied buffer again, its usage count will drop by
one and it will no longer be considered recently used. At that point the
LRU writer can write it out. So unless there is other allocation activity
going on, the scan_whole_pool_seconds mechanism will never provide the
bound on time to scan and write everything you hope it will.

And if there's other allocations going on, the much more powerful JIT
mechanism will scan the whole pool plenty fast if you bump the already
exposed multiplier tunable up. In my tests where the buffer cache was
filled with mostly dirty buffers that couldn't be re-used (something
relatively easy to trigger with pgbench tests), I've actually watched the
new code scan >90% of the buffer cache looking for those few reusable
buffers in the pool in a single invocation. This would be like setting
bgwriter_lru_percent=90.0 in the old configuration, but it only gets that
aggressive when the distribution of pages in the buffer cache demands it,
and when it has reason to believe going that fast will be helpful.

The completely understandable line of thinking that led to your request
here is one of my concerns with exposing scan_whole_pool_seconds as a
tunable. It may suggest to people that if they set the number very low,
it will assure all dirty buffers will be scanned and written within that
time bound. That's certainly not the case; both the maxpages and the
usage count information will actually drive the speed that mechanism plods
through the buffer cache. It really isn't useful for scanning fast.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Decibel! <decibel(at)decibel(dot)org>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-06 22:50:41
Message-ID: 20070906225040.GQ38801@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 06, 2007 at 09:20:31AM -0500, Kevin Grittner wrote:
> >>> On Wed, Sep 5, 2007 at 10:31 PM, in message
> <Pine(dot)GSO(dot)4(dot)64(dot)0709052324020(dot)25284(at)westnet(dot)com>, Greg Smith
> <gsmith(at)gregsmith(dot)com> wrote:
> >
> > -There are two magic constants in the code:
> >
> > int smoothing_samples = 16;
> > float scan_whole_pool_seconds = 120.0;
> >
>
> > I personally
> > don't feel like these constants need to be exposed for tuning purposes;
>
> > Determining
> > whether these should be exposed as GUC tunables is certainly an open
> > question though.
>
> If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
> allay all of my concerns about this patch. Basically, our problems were

I like the idea of not having that as a GUC, but I'm doubtful that it
can be hard-coded like that. What if checkpoint_timeout is set to 120?
Or 60? Or 2000?

I don't know that there should be a direct correlation, but ISTM that
scan_whole_pool_seconds should take checkpoint intervals into account
somehow.
--
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-07 00:20:13
Message-ID: 46E052ED.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> On Thu, Sep 6, 2007 at 11:27 AM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0709061121020(dot)14491(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Thu, 6 Sep 2007, Kevin Grittner wrote:
>
> I have been staring carefully at your configuration recently, and I would
> wager that you could turn off the LRU writer altogether and still meet
> your requirements in 8.2.

I totally agree that it is of minor benefit compared to the all-writer,
if it even matters at all. I knew that when I chose the settings.

> Here's what you've got right now:
>
>> shared_buffers = 160MB (=20000 buffers)
>> bgwriter_lru_percent = 20.0
>> bgwriter_lru_maxpages = 200
>> bgwriter_all_percent = 10.0
>> bgwriter_all_maxpages = 600
>
> With the default delay of 200ms, this has the LRU-writer scanning the
> whole pool every 1 second,

Whoa! Apparently I've totally misread the documentation. I thought that
the bgwriter_lru_percent was scanned from the lru end each time; I would
not expect that it would ever get beyond the oldest 10%. I put that in
just as a guard to keep the backends from having to wait for the OS write.
I've always doubted whether it was helping, but "it wasn't broke"....

> while the all-writer scans every two
> seconds--assuming they don't hit the write limits. If some event were to
> dirty the whole pool in 200ms, it might take as much as 6.7 seconds to
> write everything out (20000 / 600 * 200 ms) via the all-scan.

Right. Since the file system didn't seem to be able to accept writes
faster than 800 PostgreSQL pages per second, and I wanted to leave a
LITTLE slack, I set that limit. We don't seem to hit it, as far as I can
tell. In fact, the output rate would be naturally fairly smooth, if not
for the "hold all dirty pages until the last possible moment, then write
them all to the OS and fsync" approach.

> There's a second low-level issue involved here. When a page becomes
> dirty, that implies it was also recently used, which means the LRU writer
> won't touch it. That page can't be written out by the LRU writer until an
> entire pass has been made over the shared_buffer pool while looking for
> buffers to allocate for new activity. When the allocation clock-sweep
> passes over the newly dirtied buffer again, its usage count will drop by
> one and it will no longer be considered recently used. At that point the
> LRU writer can write it out.

How low does the count have to go, or does it track the count when it
becomes dirty and look for a decrease?

> So unless there is other allocation activity
> going on, the scan_whole_pool_seconds mechanism will never provide the
> bound on time to scan and write everything you hope it will.

That may not be an issue for the environment where this has been a problem
for us -- the web hits are coming in at a pretty good rate 24/7. (We have
a couple dozen large companies scanning data through HTTP SOAP requests
all the time.) This should keep us reading new pages, which covers this,
yes?

> where the buffer cache was
> filled with mostly dirty buffers that couldn't be re-used

That would be the condition that would be the killer with a synchronous
checkpoint if the OS cache has already had some dirty pages trickled out.
If we can hit this condition in our web database, either the load
distributed checkpoint will save us, or we can't use 8.3. Period.

> The completely understandable line of thinking that led to your request
> here is one of my concerns with exposing scan_whole_pool_seconds as a
> tunable. It may suggest to people that if they set the number very low,
> it will assure all dirty buffers will be scanned and written within that
> time bound. That's certainly not the case; both the maxpages and the
> usage count information will actually drive the speed that mechanism plods
> through the buffer cache. It really isn't useful for scanning fast.

I'm not clear on the benefit of not writing the recently accessed dirty
pages when there are no less recently used dirty pages. I do trust the OS
to not write them before they age out in that cache, and the OS cache
doesn't start writing dirty pages from its cache until they reach a
certain percentage of the cache space, so I'd just as soon let the OS know
that the MRU dirty pages are there, so it knows that it's time to start
working on the LRU pages in its cache.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-07 01:08:53
Message-ID: 8063.1189127333@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> On Thu, Sep 6, 2007 at 11:27 AM, in message
> <Pine(dot)GSO(dot)4(dot)64(dot)0709061121020(dot)14491(at)westnet(dot)com>, Greg Smith
> <gsmith(at)gregsmith(dot)com> wrote:
>> With the default delay of 200ms, this has the LRU-writer scanning the
>> whole pool every 1 second,
>
> Whoa! Apparently I've totally misread the documentation. I thought that
> the bgwriter_lru_percent was scanned from the lru end each time; I would
> not expect that it would ever get beyond the oldest 10%.

I believe you're correct and Greg got this wrong. I won't draw any
conclusions about whether the LRU stuff is actually doing you any good
though.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-07 02:28:39
Message-ID: Pine.GSO.4.64.0709062159040.10836@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 6 Sep 2007, Kevin Grittner wrote:

> I thought that the bgwriter_lru_percent was scanned from the lru end
> each time; I would not expect that it would ever get beyond the oldest
> 10%.

You're correct; I stated that badly. What I should have said is that your
LRU writer could potentially scan the pool as fast as once per second if
there were enough allocations going on.

> How low does the count have to go, or does it track the count when it
> becomes dirty and look for a decrease?

The usage count has to be 0 before a page can be re-used for a new
allocation, and the LRU background writer only writes out potentially
reusable pages that are dirty. So the count has to be 0 before it will
write it.

> This should keep us reading new pages, which covers this, yes?

One would hope. Your whole arrangement of shared_buffers,
checkpoint_segments, and related parameters will need to be reconsidered
for 8.3; you've got a delicated balanced arrangement for your 8.2 setup
right now that's working for you, but just translating it straight to 8.3
won't get you what you want. I'll get back to the message you already
sent on that subject when I get enough time to address it fully.

> I'm not clear on the benefit of not writing the recently accessed dirty
> pages when there are no less recently used dirty pages.

This presumes PostgreSQL has some notion of the balance of recently
accessed vs. not accessed dirty pages, which it does not. Buffers get
updated individually, and there's no mechanism summarizing what's in
there; you have to scan the buffer cache yourself to figure that out. I
do some of that in this new patch, tracking things like how many buffers
are scanned on average to find reusable ones.

Many months ago, I wrote a very complicated re-implementation of the
all-scan portion of the background writer that tracked the usage count of
everything it looked at, kept statistics about how many pages were dirty
at each usage count, then targeted how high of a usage count could be
written given some information about what I/O rate you felt your devices
could sustain. This did exactly what you're asking for here: wrote
whatever dirty pages were around starting with the ones that hadn't been
recently used, then worked its way up to pages with a higher usage count
if the recently used ones were all clean.

As far as I've been able to tell, and from Heikki's test results, the load
distributed checkpoint was a better answer to this problem. Rather than
constantly fight to get pages with high usage counts out all the time,
just spread the checkpoint out instead and deal with them only then. I
gave up on that branch of code while he removed the all-scan writer
altogether as part of committing LDC. I suspect the path I was following
was exactly what you think you'd like to have, but it seems that it's not
actually needed.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-07 10:03:43
Message-ID: 1189159423.4175.501.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2007-09-05 at 23:31 -0400, Greg Smith wrote:

> Tom gets credit for naming the attached patch, which is my latest attempt to
> finalize what has been called the "Automatic adjustment of
> bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but
> that's where it started.

This is a big undertaking, so well done for going for it.

> I decided to use pgbench for running my tests. The scripting framework to
> collect all that data and usefully summarize it is now available as
> pgbench-tools-0.2 at
> http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

For me, the main role of the bgwriter is to avoid dirty writes in
backends. The purpose of doing that is to improve the response time
distribution as perceived by users. I think that is what we should be
measuring, perhaps in a simple way such as calculating the 90th
percentile of the response time distribution. Looking at the "headline
numbers" especially tps is notoriously difficult to determine any
meaning from test results.

Looking at the tps also tempts us to run a test which maxes out the
server, an area we already know and expect the bgwriter to be unhelpful
in.

If I run a server at or below 70% capacity, what settings of the
bgwriter help maintain my response time distribution?

> Coping with idle periods
> ------------------------
>
> While I was basically happy with these results, the data Kevin Grittner
> submitted in response to my last call for commentary left me concerned. While
> the JIT approach works fine as long as your system is active, it does
> absolutely nothing if the system is idle. I noticed that a lot of the writes
> that were being done by the client backends were after idle periods where the
> JIT writer just didn't react fast enough during the ramp-up. For example, if
> the system went from idle for a while to full-speed just as the 200ms sleep
> started, by the time the BGW woke up again the backends could have needed to
> write many buffers already themselves.

You've hit the nail on the head there. I can't see how you can do
anything sensible when the bgwriter keeps going to sleep for long
periods.

The bgwriter's activity curve should ideally be the same shape as a
critically damped harmonic oscillator. It should wake up, lots of
writing if needed, then trail off over time. The only way to do that
seems to be to vary the sleep automatically, or make short sleeps.

For me, the bgwriter should sleep for at most 10ms at a time. If it has
nothing to do it can go straight back to sleep again. Trying to set that
time is fairly difficult, so it would be better not to have to set it at
all.

If you've changed bgwriter so it doesn't scan if no blocks have been
allocated, I don't see any reason to keep the _delay parameter at all.

> I think I can safely say there is a level of intelligence going into what the
> LRU background writer does with this patch that has never been applied to this
> problem before. There have been a lot of good ideas thrown out in this area,
> but it took a hybrid approach that included and carefully balanced all of them
> to actually get results that I felt were usable. What I don't know is whether
> that will also be true for other testers.

I get the feeling that what we have here is better than what we had
before, but I guess I'm a bit disappointed we still have 3 magic
parameters, or 5 if you count your hard-coded ones also.

There's still no formal way to tune these. As long as we have *any*
magic parameters, we need a way to tune them in the field, or they are
useless. At very least we need a plan for how people will report results
during Beta. That means we need a log_bgwriter (better name, please...)
parameter that provides information to assist with tuning. At the very
least we need this to be present during Beta, if not beyond.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-07 15:48:42
Message-ID: Pine.GSO.4.64.0709071119140.16702@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 7 Sep 2007, Simon Riggs wrote:

> I think that is what we should be measuring, perhaps in a simple way
> such as calculating the 90th percentile of the response time
> distribution.

I do track the 90th percentile numbers, but in these pgbench tests where
I'm writing as fast as possible they're actually useless--in many cases
they're *smaller* than the average response, because there are enough
cases where there is a really, really long wait that they skew the average
up really hard. Take a look at any of the inidividual test graphs and
you'll see what I mean.

> Looking at the tps also tempts us to run a test which maxes out the
> server, an area we already know and expect the bgwriter to be unhelpful
> in.

I tried to turn that around and make my thinking be that if I built a
bgwriter that did most of the writes without badly impacting the measure
we know and expect it to be unhelpful in, that would be more likely to
yield a robust design. It kept me out of areas where I might have built
something that had to be disclaimed with "don't run this when the server
is maxed out".

> For me, the bgwriter should sleep for at most 10ms at a time. If it has
> nothing to do it can go straight back to sleep again. Trying to set that
> time is fairly difficult, so it would be better not to have to set it at
> all.

I wanted to get this patch out there so people could start thinking about
what I'd done and consider whether this still fit into the 8.3 timeline.
What I'm doing myself right now is running tests with a much lower setting
for the delay time--am testing 20ms right now. I personally would be
happy saying it's 10ms and that's it. Is anyone using a time lower than
that right now? I seem to recall that 10ms was also the shortest interval
Heikki used in his tests as well.

> I get the feeling that what we have here is better than what we had
> before, but I guess I'm a bit disappointed we still have 3 magic
> parameters, or 5 if you count your hard-coded ones also.

I may be able to eliminate more of them, but I didn't want to take them
out before beta. If it can be demonstrated that some of these parameters
can be set to specific values and still work across a wider range of
applications than what I've tested, then there's certainly room to fix
some of these, which actually makes some things easier. For example, I'd
be more confident fixing the weighted average smoothing period to a
specific number if I knew the delay was fixed, and there's two parameters
gone. And the multiplier is begging to be eliminated, just need some more
data to confirm that's true.

> There's still no formal way to tune these. As long as we have *any*
> magic parameters, we need a way to tune them in the field, or they are
> useless. At very least we need a plan for how people will report results
> during Beta. That means we need a log_bgwriter (better name, please...)
> parameter that provides information to assist with tuning.

Once I got past the "does it work?" stage, I've been doing all the tuning
work using a before/after snapshot of pg_stat_bgwriter data during a
representative snapshot of activity and looking at the delta. Been a
while since I actually looked into the logs for anything. It's very
straightforward to put together a formal tuning plan using the data in
there, particularly compared to the the impossibility of creating such a
plan in the current code.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-07 17:08:42
Message-ID: 1189184922.4175.545.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2007-09-07 at 11:48 -0400, Greg Smith wrote:
> On Fri, 7 Sep 2007, Simon Riggs wrote:
>
> > I think that is what we should be measuring, perhaps in a simple way
> > such as calculating the 90th percentile of the response time
> > distribution.
>
> I do track the 90th percentile numbers, but in these pgbench tests where
> I'm writing as fast as possible they're actually useless--in many cases
> they're *smaller* than the average response, because there are enough
> cases where there is a really, really long wait that they skew the average
> up really hard. Take a look at any of the inidividual test graphs and
> you'll see what I mean.

I've looked at the graphs now, but I'm not any wiser, I'm very sorry to
say. We need something like a frequency distribution curve, not just the
actual times. Bottom line is we need a good way to visualise the
detailed effects of the patch.

I think we should do some more basic tests to see where those outliers
come from. We need to establish a clear link between number of dirty
writes and response time. If there is one, which we all believe, then it
is worth minimising those with these techniques. We might just be
chasing the wrong thing.

Perhaps output the number of dirty blocks written on the same line as
the output of log_min_duration_statement so that we can correlate
response time to dirty-block-writes on that statement.

For me, we can enter Beta while this is still partially in the air. We
won't be able to get this right without lots of other feedback. So I
think we should concentrate now on making sure we've got the logging in
place so we can check whether your patch works when its out there. I'd
say lets include what you've done and then see how it works during Beta.
We've been trying to get this right for years now, so we have to allow
some slack to make sure we get this right. We can reduce or strip out
logging once we go RC.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-07 17:48:30
Message-ID: Pine.GSO.4.64.0709071324380.7439@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 7 Sep 2007, Simon Riggs wrote:

> I think we should do some more basic tests to see where those outliers
> come from. We need to establish a clear link between number of dirty
> writes and response time.

With the test I'm running, which is specifically designed to aggrevate
this behavior, the outliers on my system come from how Linux buffers
writes. I can adjust them a bit by playing with the parameters as
described at http://www.westnet.com/~gsmith/content/linux-pdflush.htm but
on the hardware I've got here (single 7200RPM disk for database, another
for WAL) they don't move much. Once /proc/meminfo shows enough Dirty
memory that pdflush starts blocking writes, game over; you're looking at
multi-second delays before my plain old IDE disks clear enough debris out
to start responding to new requests even with the Areca controller I'm
using.

> Perhaps output the number of dirty blocks written on the same line as
> the output of log_min_duration_statement so that we can correlate
> response time to dirty-block-writes on that statement.

On Linux at least, I'd expect this won't reveal much. There, the
interesting correlation is with how much dirty data is in the underlying
OS buffer cache. And exactly how that plays into things is a bit strange
sometimes. If you go back to Heikki's DBT2 tests with the background
writer schemes he tested, he got frustrated enough with that disconnect
that he wrote a little test program just to map out the underlying
weirdness:
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00261.php

I've confirmed his results on my system and done some improvements to that
program myself, but pushed further work on it to the side to finish up the
main background writer task instead. I may circle back to that. I'd
really like to run all this on another OS as well (I have Solaris 10 on my
server box but not fully setup yet), but I can only volunteer so much time
to work on all this right now.

If there's anything that needs to be looked at more carefully during tests
in this area, it's getting more data about just what the underlying OS is
doing while all this is going on. Just the output from vmstat/iostat is
very informative. Those using DBT2 for their tests get some nice graphs
of this already. I've done some pgbench-based tests that included that
before that were very enlightening but sadly that system isn't available
to me anymore.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-08 17:39:21
Message-ID: Pine.GSO.4.64.0709081245210.14490@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 7 Sep 2007, Simon Riggs wrote:

> For me, the bgwriter should sleep for at most 10ms at a time.

Here's the results I got when I pushed the time down significantly from
the defaults, with some of the earlier results for comparision:

info | set | tps | cleaner_pct
-----------------------------------------------+-----+------+-------------
jit multiplier=2.0 scan_whole=120s delay=200ms| 17 | 981 | 99.98
jit multiplier=1.0 scan_whole=120s delay=200ms| 18 | 970 | 99.99

jit multiplier=1.0 scan_whole=120s delay=20ms | 20 | 956 | 92.34
jit multiplier=2.0 scan_whole=120s delay=20ms | 21 | 967 | 99.94

jit multiplier=1.5 scan_whole=120s delay=10ms | 22 | 944 | 97.91
jit multiplier=2.0 scan_whole=120s delay=10ms | 23 | 981 | 99.7

It seems I have to push the multiplier higher to get good results when
using a much lower interval, which was expected, but the fundamentals all
scale down to the running much faster the way I'd hoped.

I'm tempted to make the default 10ms, adjust some of the other constants
just a bit to optimize better for that time scale: make the default
multiplier 2.0, increase the weighted average sample period, and perhaps
reduce scan_whole a bit because that's barely doing anything at 10ms. If
no one discovers any problems with working that way during beta, then
consider locking them in for the RC. That would leave just the multiplier
and maxpages as the exposed tunables, and it's very easy to tune maxpages
just by watching pg_stat_bgwriter. This would obviously be a very
aggressive plan--it would be eliminating GUCs and reducing flexibility for
people in the field, aiming instead at making this more automatic for the
average case.

If anyone has a reason why they feel the bgwriter_delay needs to be a
tunable or why the rate might need to run even faster than 10ms, now would
be a good time to say why.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-08 18:17:28
Message-ID: 26174.1189275448@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> If anyone has a reason why they feel the bgwriter_delay needs to be a
> tunable or why the rate might need to run even faster than 10ms, now would
> be a good time to say why.

You'd be hard-wiring the thing to wake up 100 times per second? Doesn't
sound like a good plan from here. Keep in mind that not everyone wants
their machine to be dedicated to Postgres, and some people even would
like their CPU to go to sleep now and again.

I've already gotten flak about the current default of 200ms:
https://bugzilla.redhat.com/show_bug.cgi?id=252129
I can't imagine that folk with those types of goals will tolerate
an un-tunable 10ms cycle.

In fact, given the numbers you show here, I'd say you should leave the
default cycle time at 200ms. The 10ms value is eating way more CPU and
producing absolutely no measured benefit relative to 200ms...

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-08 19:01:27
Message-ID: Pine.GSO.4.64.0709081449431.2440@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 8 Sep 2007, Tom Lane wrote:

> I've already gotten flak about the current default of 200ms:
> https://bugzilla.redhat.com/show_bug.cgi?id=252129
> I can't imagine that folk with those types of goals will tolerate an
> un-tunable 10ms cycle.

That's the counter-example for why lowering the default is unacceptable I
was looking for. Scratch bgwriter_delay off the list of things that might
be fixed to a specific value.

Will return to the drawing board to figure out a way to incorporate what
I've learned about running at 10ms into a tuning plan that still works
fine at 200ms or higher. The good news as far as I'm concerned is that I
haven't had to adjust the code so far, just tweak the existing knobs.

> In fact, given the numbers you show here, I'd say you should leave the
> default cycle time at 200ms. The 10ms value is eating way more CPU and
> producing absolutely no measured benefit relative to 200ms...

My server is a bit underpowered to run at 10ms and gain anything when
doing a stress test like this; I was content that it didn't degrade
performance significantly, that was the best I could hope for. I would
expect the class of systems that Simon and Heikki are working with could
show significant benefit from running the BGW that often.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-08 19:22:22
Message-ID: 27316.1189279342@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Sat, 8 Sep 2007, Tom Lane wrote:
>> In fact, given the numbers you show here, I'd say you should leave the
>> default cycle time at 200ms. The 10ms value is eating way more CPU and
>> producing absolutely no measured benefit relative to 200ms...

> My server is a bit underpowered to run at 10ms and gain anything when
> doing a stress test like this; I was content that it didn't degrade
> performance significantly, that was the best I could hope for. I would
> expect the class of systems that Simon and Heikki are working with could
> show significant benefit from running the BGW that often.

Quite possibly. So it sounds like we still need to expose
bgwriter_delay as a tunable.

It might be interesting to consider making the delay auto-tune: if you
wake up and find nothing (much) to do, sleep longer the next time,
conversely shorten the delay when work picks up. Something for 8.4,
though, at this point.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-08 19:34:00
Message-ID: 87abrxx87r.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Greg Smith" <gsmith(at)gregsmith(dot)com> writes:

> On Sat, 8 Sep 2007, Tom Lane wrote:
>
>> I've already gotten flak about the current default of 200ms:
>> https://bugzilla.redhat.com/show_bug.cgi?id=252129
>> I can't imagine that folk with those types of goals will tolerate an
>> un-tunable 10ms cycle.
>
> That's the counter-example for why lowering the default is unacceptable I was
> looking for. Scratch bgwriter_delay off the list of things that might be fixed
> to a specific value.

Ok, time for the obligatory contrarian voice here. It's all well and good to
aim to eliminate GUC variables but I don't think it's productive to do so by
simply hard-wiring them.

Firstly that doesn't really make life any easier than simply finding good
defaults and documenting that DBAs probably shouldn't be bothering to tweak
them.

Secondly it's unlikely to work. The variables under consideration may have
reasonable defaults but they're not likely to have defaults will work in every
case. This example is pretty typical. There aren't many variables that will
have a reasonable default which will work for both an interactive desktop
where Postgres is running in the background and Sun's 1000+ process
benchmarks.

What I think is more likely to work is looking for ways to make these
variables auto-tuning. That eliminates the knob not by just hiding it away and
declaring it doesn't exist but by architecting the system so that there really
is no knob that might need tweaking.

Perhaps what would work better here is having a semaphore which bgwriter
sleeps on which backends wake up whenever the clock sweep hand completes a
cycle. Or gets within a certain fraction of a cycle of catching up.

Or perhaps bgwriter shouldn't be adjusting the number of pages it processes at
all and instead it should only be adjusting the sleep time. So it would always
process a full cycle for example but adjust the sleep time based on what
percentage of the cycle the backends used up in the last sleep time.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-08 20:15:34
Message-ID: Pine.GSO.4.64.0709081525260.2440@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 8 Sep 2007, Tom Lane wrote:

> It might be interesting to consider making the delay auto-tune: if you
> wake up and find nothing (much) to do, sleep longer the next time,
> conversely shorten the delay when work picks up. Something for 8.4,
> though, at this point.

I have a couple of pages of notes on how to tune the delay automatically.
The tricky part are applications that go from 0 to full speed with little
warning; the first few seconds of the stock market open come to mind.
What I was working toward was considering what you set the delay to as a
steady-state value, and then the delay cranks downward as activity levels
go up. As activity dies off, it slowly returns to the default again.

But I realized that I needed to get all this other stuff working, all the
statistics counters exposed usefully, and then collect a lot more data
before I could implement that plan. Definately something that might fit
into 8.4, completely impossible for 8.3.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-08 20:26:15
Message-ID: Pine.GSO.4.64.0709081501480.2440@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 6 Sep 2007, Decibel! wrote:

> I don't know that there should be a direct correlation, but ISTM that
> scan_whole_pool_seconds should take checkpoint intervals into account
> somehow.

Any direct correlation is weak at this point. The LRU cleaner has a small
impact on checkpoints, in that it's writing out buffers that may make the
checkpoint quicker. But this particular write trickling mechanism is not
aimed directly at flushing the whole pool; it's more about smoothing out
idle periods a bit.

Also, computing the checkpoint interval is itself tricky. Heikki had to
put some work into getting something that took into account both the
timeout and segments mechanisms to gauge progress, and I'm not sure I can
directly re-use that because it's really only doing that while the
checkpoint is active. I'm not saying it's a bad idea to have the expected
interval as an input to the model, just that it's not obvious to me how to
do it and whether it would really help.

> I like the idea of not having that as a GUC, but I'm doubtful that it
> can be hard-coded like that. What if checkpoint_timeout is set to 120?
> Or 60? Or 2000?

Someone using 60 or 120 has checkpoint problems way bigger than the LRU
cleaner can be expected to help with. How fast the reusable buffers it
can write are pushed out is the least of their problems. Also, I'd expect
that the only cases using such a low value for a good reason are doing so
because they have enormous amounts of activity on their system, and in
that case the primary JIT mechanism should dominate how the LRU cleaner
treats them. scan_whole_pool_seconds doesn't do anything if the primary
mechanism was already planning to scan more buffers than it aims for.

Someone who has very infrequent checkpoints and therefore low activity,
like your 2000 case, can expect that the LRU cleaner will lap and catch up
to the strategy point about 2 minutes after any activity and then follow
directly behind it with the way I've set this up. If that's cleaning the
buffer cache too aggressively, I think those in that situation would be
better served by constraining the maxpages parameter; that's directly
adjusting what I'd expect their real issue is, how fast pages can flush to
disk, rather than the secondary one of how fast the pool is being scanned.

I picked 2 minutes for that value because it's as slow as I can make it
and still serve its purpose, while not feeling to me like it's too fast
for a relatively idle system even if someone set maxpages=1000.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-09 01:53:37
Message-ID: 20070909015337.GD6331@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> On Sat, 8 Sep 2007, Tom Lane wrote:
>
>> It might be interesting to consider making the delay auto-tune: if you
>> wake up and find nothing (much) to do, sleep longer the next time,
>> conversely shorten the delay when work picks up. Something for 8.4,
>> though, at this point.
>
> I have a couple of pages of notes on how to tune the delay automatically.
> The tricky part are applications that go from 0 to full speed with little
> warning; the first few seconds of the stock market open come to mind.

Maybe have the backends send a signal to bgwriter when they see it
sleeping and are overwhelmed by work. That way, bgwriter can sleep for
a few seconds, safe in the knowledge that somebody else will wake it up
if needed sooner. The way backends would detect that bgwriter is
sleeping is that bgwriter would keep an atomic flag in shared memory,
and it gets set only if it's going to sleep for long (so if it's going
to sleep for (say) 100ms or less, it doesn't set the flag, so the
backends won't signal it). In order to avoid a huge amount of signals
when all backends suddenly start working at the same instant, have the
signal itself be sent only by the first backend that manages to
LWLockConditionalAcquire a lwlock that's only used for that purpose.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-18 04:37:47
Message-ID: Pine.GSO.4.64.0709172352050.4502@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 8 Sep 2007, Greg Smith wrote:

> Here's the results I got when I pushed the time down significantly from the
> defaults
> info | set | tps | cleaner_pct
> -----------------------------------------------+-----+------+-------------
> jit multiplier=1.0 scan_whole=120s delay=20ms | 20 | 956 | 92.34
> jit multiplier=2.0 scan_whole=120s delay=20ms | 21 | 967 | 99.94
>
> jit multiplier=1.5 scan_whole=120s delay=10ms | 22 | 944 | 97.91
> jit multiplier=2.0 scan_whole=120s delay=10ms | 23 | 981 | 99.7
> It seems I have to push the multiplier higher to get good results when using
> a much lower interval

Since I'm not exactly overwhelmed processing field reports, I've continued
this line of investigation myself...increasing the multiplier to 3.0 got
me another nine on the buffers written by the LRU BGW without a
significant change in performance:

info | set | tps | cleaner_pct
-----------------------------------------------+-----+------+-------------
jit multiplier=3.0 scan_whole=120s delay=10ms | 24 | 967 | 99.95

After thinking for a bit about why the 10ms case wasn't working so well
without a big multiplier, I considered that the default moving average
smoothing makes the sample period operating over such a short period of
time (10ms * 16=160ms) that it's unlikely to cover a typical pause that
one might want to smooth over. My initial thinking was to increase the
period of the smoothing so that it's of similar length to the default case
even when the interval goes down, but that didn't really improve anything
(note that the 16 case here is the default setup with just the delay at
10ms, which was a missing piece of data from the above as well--I only
tested with larger multipliers above at 10ms):

info | set | tps | cleaner_pct
----------------------------------------------+-----+------+-------------
jit multiplier=1.0 delay=10ms smoothing=16 | 27 | 982 | 89.4
jit multiplier=1.0 delay=10ms smoothing=64 | 26 | 946 | 89.55
jit multiplier=1.0 delay=10ms smoothing=320 | 25 | 970 | 89.53

What I realized is that after rounding the number of buffers to an
integer, dividing a very short period of activity by the smoothing
constant was resulting in the smoothing value usually dropping to 0 and
not doing much. This made me wonder how much the weighted average
smoothing was really doing in the default case. I put that code in months
ago and I hadn't looked recently at its effectiveness. Here's a
comparison:

info | set | tps | cleaner_pct
----------------------------------------------+-----+------+-------------
jit multiplier=1.0 delay=200ms smoothing=16 | 18 | 970 | 99.99
jit multiplier=1.0 delay=200ms smoothing=off | 28 | 957 | 97.16

All this data support my suggestion that the exact value of the smoothing
period constant isn't really a critical one. It appears moderately
helpful to have that logic on in some cases and the default value doesn't
seem to hurt the cases where I'd expect it to be the least effective.
Tuning the multiplier is much more powerful and useful than ever touching
this constant. I could probably even pull the smoothing logic out
altogether, at the cost of increasing the burden of correctly tuning the
multiplier on the administrator. So far it looks like it's reasonable
instead to leave it as an untunable to help the default configuration, and
I'll just add a documentation note that if you decrease the interval
you'll probably have to increase the multiplier.

After going through this, the extra data gives more useful baselines to do
a similar sensitivity analysis of the other item that's untunable in the
current patch:

float scan_whole_pool_seconds = 120.0;

But I'll be travelling for the next week and won't have time to look into
that myself until I get back.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-18 15:51:12
Message-ID: Pine.GSO.4.64.0709181105470.27154@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

It was suggested to me today that I should clarify how others should be
able to test this patch themselves by writing a sort of performance
reviewer's guide; that information has been scattered among material
covering development. That's what you'll find below. Let me know if any
of it seems confusing and I'll try to clarify. I'll be checking my mail
and responding intermitantly while I'm away, just won't be able to run any
tests myself until next week.

The latest version of the background writer code that I've been reporting
on is attached to the first message in this thread:

http://archives.postgresql.org/pgsql-hackers/2007-09/msg00214.php

I haven't found any reason so far to update that code, the existing
exposed tunables still appear sufficient for all the situations I've
found.

Track Buffer Allocations and Cleaner Efficiency
-----------------------------------------------

First you apply the patch inside buf-alloc-2.patch.gz , which adds several
entries to pg_stat_bgwriter; it applied cleanly to HEAD at the point when
I generated it. I'd suggest testing that one to collect baseline
information with the current background writer, and to confirm that the
overhead of tracking the buffer allocations by itself doesn't cause a
performance hit, before applying the second patch. I keep two clusters
going on the same port, one with just buf-alloc-2, one with both patches,
to be able to make such comparisions, only having one active at a time.
You'll need to run initdb to create a database with the new stats in it
after applying the patch.

What I've been doing to test the effectiveness of any LRU background
writer method using this patch is take a before/after snapshot of
pg_stat_bgwriter. Then I compute the delta during the test run in order
to figure what percentage of buffers were written by the background writer
vs. the client backends; that's the number I'm reporting as cleaner_pct in
my tests. Here is an example of how to compute that against all
transactions in pg_stat_bgwriter:

select round(buffers_clean * 10000 / (buffers_backend + buffers_clean)) /
100 as cleaner_pct from pg_stat_bgwriter;

You should also monitor maxwritten_clean to make sure you've set
bgwriter_lru_maxpages high enough that it's not limiting writes. You can
always turn the background writer off by setting maxpages to 0 (it's the
only way to do so after applying the below patch).

For reference, the exact code I'm using to save the deltas and compute
everything is available within pgbench-tools-0.2 at
http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

The code inside the benchwarmer script uses a table called test_bgwriter
(schema in init/resultdb.sql), populates it before the test, then computes
the delta afterwards. bufsummary.sql generates the results I've been
putting in my messages. I assume there's a cleaner way to compute just
these numbers by resetting the statistics before the test instead, but
that didn't fit into what I was working towards.

New Background Writer Logic
---------------------------

The second patch in jit-cleaner.patch.gz applies on top of buf-alloc-2.
It modifies the LRU background writer with the just-in-time logic as I
described in the message the patches were attached to. The main tunable
there is bgwriter_lru_multiplier, which replaces bgwriter_lru_percent.
The effective range seems to be 1.0 to 3.0. You can take an existing 8.3
postgresql.conf, rename bgwriter_lru_percent to bgwriter_lru_multiplier,
adjust the value to be in the right range, and then it will work with this
patched version.

For comparing the patched vs. original BGW behavior, I've taken to keeping
definitions for both variables in a common postgresql.conf, and then I
just comment/uncomment the one I need based on which version I'm running:

bgwriter_lru_multiplier = 1.0
#bgwriter_lru_percent = 5

The main thing I've noticed so far is that as you decrease bgwriter_delay
from the default of 200ms, the multiplier has needed to be larger to
maintain the same cleaner percentage in my tests.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-25 20:14:57
Message-ID: 26266.1190751297@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> Tom gets credit for naming the attached patch, which is my latest attempt to
> finalize what has been called the "Automatic adjustment of
> bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but
> that's where it started.

I've applied this patch with some revisions.

> -The way I'm getting the passes number back from the freelist.c
> strategy code seems like it will eventually overflow

Yup ... I rewrote that. I also revised the collection of backend-write
count events, which didn't seem to me to be something the freelist.c
code should have anything to do with. It turns out that we can count
them with essentially no overhead by attaching the counter to
the existing fsync-request reporting machinery.

> -Heikki didn't like the way I pass information back from SyncOneBuffer
> back to the background writer.

I didn't either --- it was too complicated and not actually doing
anything useful. I simplified it down to the two bits that were being
used. We can always add more as needed, but since this routine isn't
even exported, I see no need to make it do more than the known callers
need it to do.

I did some marginal tweaking to the way you were doing the moving
averages --- in particular, use a float to avoid strange roundoff
behavior and force the smoothed_alloc average up when a new peak
occurs, instead of only letting it affect the behavior for one
cycle.

Also, I set the default value of bgwriter_lru_multiplier to 2.0,
as 1.0 seemed to be leaving too many writes to the backends in my
testing. That's something we can play with during beta when we'll
have more testing resources available.

I did some other cleanup in BgBufferSync too, like trying to reduce
the chattiness of the debug output, but I don't believe I made any
fundamental change in your algorithm.

Nice work --- thanks for seeing it through!

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-25 22:41:36
Message-ID: Pine.GSO.4.64.0709251823130.2193@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 25 Sep 2007, Tom Lane wrote:

>> -Heikki didn't like the way I pass information back from SyncOneBuffer
>> back to the background writer.
> I didn't either --- it was too complicated and not actually doing
> anything useful.

I suspect someone (possibly me) may want to put back some of that same
additional complication in the future, but I'm fine with it not being
there yet. The main thing I wanted accomplished was changing the return
to a bitmask of some sort and that's there now; adding more data to that
interface later is at least easier now.

> Also, I set the default value of bgwriter_lru_multiplier to 2.0,
> as 1.0 seemed to be leaving too many writes to the backends in my
> testing.

The data I've collected since originally submitting the patch agrees that
2.0 is probably a better default as well.

I should have time to take an initial stab this week at updating the
documentation to reflect what's now been commited, and to see how this
stacks on top of HOT running pgbench on my test system.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD