Just-in-time Background Writer Patch+Test Results

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Just-in-time Background Writer Patch+Test Results
Date: 2007-09-06 03:31:56
Message-ID: Pine.GSO.4.64.0709052324020.25284@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom gets credit for naming the attached patch, which is my latest attempt to
finalize what has been called the "Automatic adjustment of
bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but
that's where it started.

Background on testing
---------------------

I decided to use pgbench for running my tests. The scripting framework to
collect all that data and usefully summarize it is now available as
pgbench-tools-0.2 at
http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

I hope to expand and actually document use of pgbench-tools in the future but
didn't want to hold the rest of this up on that work. That page includes basic
information about what my testing environment was and why I felt this was an
appropriate way to test background writer efficiency.

Quite a bit of raw data for all of the test sets summarized here is at
http://www.westnet.com/~gsmith/content/bgwriter/

The patches attached to this message are also available at:
http://www.westnet.com/~gsmith/content/postgresql/buf-alloc-2.patch
http://www.westnet.com/~gsmith/content/postgresql/jit-cleaner.patch
(This is my second attempt to send this message, don't know why the
earlier one failed; using gzip'd patches for this one and hopefully there
won't be a dupe)

Baseline test results
---------------------

The first patch to apply attached to this message is the latest buf-alloc-2
that adds counters to pgstat_bgwriter for everything the background writer is
doing. Here's what we get out of the standard 8.3 background writer before and
after applying that patch, at various settings:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
HEAD nobgwriter | 5 | 994 |
HEAD+buf-alloc-2 nobgwriter | 6 | 1012 | 0
HEAD+buf-alloc-2 LRU=0.5%/500 | 16 | 974 | 15.94
HEAD+buf-alloc-2 LRU=5%/500 | 19 | 983 | 98.47
HEAD+buf-alloc-2 LRU=10%/500 | 7 | 997 | 99.95

cleaner_pct is what percentage of the writes the BGW LRU cleaner did relative
to a total that includes the client backend writes; writes done by checkpoints
are not included in this summary computation, it just shows the balance of
backend vs. BGW writes.

The /500 means bgwriter_lru_maxpages=500, which I already knew was about as
many pages as this server ever dirties in a 200ms cycle. Without the
buf-alloc-2 patch I don't get statistics on the LRU cleaner, I include that
number as a baseline just to suggest that the buf-alloc-2 patch itself isn't
pulling down results.

Here we see that in order to get most of the writes to happen via the LRU
cleaner rather than having the backends handle them, you'd need to play with
the settings until the bgwriter_lru_percent was somewhere between 5% and 10%,
and it seems obvious that doing this doesn't improve the TPS results. The
margin of error here is big enough that I consider all these basically the same
performance. The question then is how to get this high level of writes by the
background writer automatically, without having to know what percentage to
scan; I wanted to remove bgwriter_lru_percent, while still keeping
bgwriter_lru_maxpages strictly as a way to throttle overall BGW activity.

First JIT Implementation
------------------------

The method I described in my last message on this topic (
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00887.php ) implemented
a weighted moving average of how many pages were allocated, and based on
feedback from that I improved the code to allow a multiplier factor on top of
that. Here's the summary of those results:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
jit cleaner multiplier=1.0/500 | 9 | 981 | 94.3
jit cleaner multiplier=2.0/500 | 8 | 1005 | 99.78
jit multiplier=1.0/100 | 10 | 985 | 68.14

That's pretty good. As long as maxpages is set intelligently, it gets most of
the writes even with the multiplier of 1.0, and cranking it up to the 2.0
suggested by the original Itagaki Takahiro patch gets nearly all of them.
Again, there's really no performance change here in throughput by any of this.

Coping with idle periods
------------------------

While I was basically happy with these results, the data Kevin Grittner
submitted in response to my last call for commentary left me concerned. While
the JIT approach works fine as long as your system is active, it does
absolutely nothing if the system is idle. I noticed that a lot of the writes
that were being done by the client backends were after idle periods where the
JIT writer just didn't react fast enough during the ramp-up. For example, if
the system went from idle for a while to full-speed just as the 200ms sleep
started, by the time the BGW woke up again the backends could have needed to
write many buffers already themselves.

Ideally, idle periods should be used to slowly trickly dirty pages out, so that
there are less of them hanging around when a checkpoint shows up or so that
reusable pages are already available. The question then is how fast to go about
that trickle. Heikki's background writer tests and my own suggest that if you
make the rate during quiet periods too high, you'll clog the underlying buffers
with some writes that end up being duplicated and lower overall efficiency.
But all of those tests had the background writer going at a constant and
relatively high speed.

I wanted to keep the ability to scan the entire buffer cache, using the latest
idea of never looking at the same buffer twice, but to do that slowly when idle
and using the JIT rate otherwise. This is sort of a hybrid of the old LRU
cleaner behavior (scan a fixed %) at a low speed with the new approach (scan
based on allocations, however many of them there are). I starting with the old
default of 0.5% used by bgwriter_lru_percent (a tunable already removed by the
patch at this point) with logic to tack that onto the JIT intelligently and got
these results:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
jit multiplier=1.0 min scan=0.5% | 13 | 882 | 100
jit multiplier=1.5 min scan=0.5% | 12 | 871 | 100
jit multiplier=2.0 min scan=0.5% | 11 | 910 | 100
jit multiplier=1.0 min scan=0.25% | 14 | 982 | 98.34

It's nice to see fully 100% of the buffers written by the cleaner with the
hybrid approach; I feel that validates my idea that just a bit more work needs
to be done during idle periods to completely fix the issue with it not reacting
fast enough during the idle/full speed transition. But look at the drop in
TPS. While I'm willing to say a couple of percent change isn't significant in
a pgbench result, those <900 results are clearly bad. This is crossing that
line where inefficient writes are being done. I'm happier with the result
using the smaller min scan=0.25% even though it doesn't quite get every write
that way.

Making percentage independant of delay
--------------------------------------

But a new problem here is that if you lower bgwriter_delay, the minimum scan
percentage needs to drop too, and my goal was to remove the number of tunables
people need to tinker with. Assuming you're not stopped by the maxpages
parameter, with the default delay=200ms a scan that hits 0.5% each time will
scan 5*0.5%=2.5% of the buffer cache per second, which means it will take 24
seconds to scan the entire pool. Using 0.25% means 48 seconds between scans. I
improved the overall algorithm a bit and decided to set this parameter an
alternate way: by how long it should take to creep its way through the entire
buffer cache if the JIT code is idle. I decided I liked 120 seconds as value
for that parameter, which is a slower rate than any of the above but still a
reasonable one for a typical application. Here's what the results look like
using that approach:

info | set | tps | cleaner_pct
------------------------------------+-----+------+-------------
jit multiplier=1.0 scan_whole=120s | 18 | 970 | 99.99
jit multiplier=1.5 scan_whole=120s | 15 | 995 | 99.93
jit multiplier=2.0 scan_whole=120s | 17 | 981 | 99.98

Now here are results I'm happy with. The TPS results are almost unchanged from
where we started from, with minimal inefficient writes, but almost all the
writes are being done by the cleaner process. The results appear much less
sensitive to what you set the multiplier to. And unless you use an unresonable
low value for maxpages (which will quickly become obvious if you monitor
pg_stat_bgwriter and look for maxwritten_clean increasing fast), you'll get a
complete scan of the buffer cache within 2 minutes even if there's no system
activity. But once that's done, until more buffers are allocated the code
won't even look at the buffer cache again (as opposed to the current code,
which is always looking at buffers and acquiring locks even if nothing is going
on).

I think I can safely say there is a level of intelligence going into what the
LRU background writer does with this patch that has never been applied to this
problem before. There have been a lot of good ideas thrown out in this area,
but it took a hybrid approach that included and carefully balanced all of them
to actually get results that I felt were usable. What I don't know is whether
that will also be true for other testers.

Patch review
------------

The attached jit-cleaner.patch implements this approach, and if you just want
to look at the main code involved without having to apply the patch you can
browse the BgBufferSync function in bufmgr.c starting around line 1120 at
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c

There is lots of debugging of internals dumped into the logs if you toggle on
#define BGW_DEBUG , the gross summary of the two most important things that
show what the code is doing are logged at DEBUG1 (but should probably be pushed
lower before committing).

This code is as good as you're going to get from me before the 8.3 close. I
could do some small rewriting and certainly can document all this further as
part of getting this patch moved toward committed, but I'm out of resources to
do too much more here. Along with the big question of whether this whole idea
is worth following at all as part of 8.3, here are the remaining small
questions I feel review feedback would be valuable on related to my specific
code:

-The way I'm getting the passes number back from the freelist.c strategy code
seems like it will eventually overflow the long I'm using for the intermediate
results when I execute statements like this:

strategy_position=(long)strategy_passes * NBuffers + strategy_buf_id;

I'm not sure if the code would be better if I were to use a 64-bit integer for
strategy_position instead, or if I should just rewrite the code to separate out
the passes multiplication--which will make it less elegant to read but should
make overflow issues go away.

-Heikki didn't like the way I pass information back from SyncOneBuffer back to
the background writer. The bitmask approach I'm using has added flexibility to
writing more intelligent background writers in the future. I have written more
complicated ones than any of the approaches mentioned here in the past, using
things like the usage_count information returned, but the simpler
implementation here that ignores that. I could simplify this interface if I
had to, but I like what I've done as a solid structure for future coding as
it's written right now.

-There are two magic constants in the code:

int smoothing_samples = 16;
float scan_whole_pool_seconds = 120.0;

I believe I've done enough testing recently and in the past to say these are
reasonable numbers for most installations, and high-throughput systems are
going to care more about tuning the multiplier GUC than either of these. In
the interest of having less knobs people can fool with and break, I personally
don't feel like these constants need to be exposed for tuning purposes; they
don't have a significant impact on how the underlying model works. Determining
whether these should be exposed as GUC tunables is certainly an open question
though.

-I bumped the default for bgwriter_lru_maxpages to 100 so that typical low-end
systems should get an automatically tuning LRU background writer out of the box
in 8.3. This is a big change from the 5 that was used in the older releases.
If you keep everything at the defaults this represents a maximum theoretical
write rate for the BGW of 4MB/s, which isn't very much relative to modern
hardware.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

Attachment Content-Type Size
jit-cleaner.patch.gz application/octet-stream 6.5 KB
buf-alloc-2.patch.gz application/octet-stream 4.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-09-06 04:56:21 Re: [PATCHES] Lazy xid assignment V4
Previous Message John DeSoi 2007-09-06 01:04:43 Re: Has anyone tried out the PL/pgSQL debugger?