Count backend self-sync calls

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Count backend self-sync calls
Date: 2010-11-14 22:07:53
Message-ID: 4CE05DB9.9080309@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The attached patch adds a new field to pg_stat_bgwriter, counting the
number of times backends execute their own fsync calls. Normally, when
a backend needs to fsync data, it passes a request to the background
writer, which then absorbs the call into its own queue of work to do.
However, under some types of heavy system load, the associated queue can
fill. When this happens, backends are forced to do their own fsync
call. This is potentially much worse than when they do a regular write.

The really nasty situation is when the background writer is busy because
it's executing a checkpoint. In that case, it's possible for the
backend fsync calls to start competing with the ones the background
writer is trying to get done, causing the checkpoint sync phase to
execute slower than it should. I've seen the sync phase take over 45
minutes on a really busy server once it got into this condition, where
hundreds of clients doing their own backend fsync calls were fighting
against the checkpoint fsync work. With this patch, you can observe
that happening as an upwards spike in
pg_stat_bgwriter.buffers_backend_sync, which as documented is an
inclusive subset of the total shown in buffers_backend.

While it takes a busier system than I can useful show how to simulate
here to show a really bad situation, I'm able to see some of these
unabsorbed backend fsync calls when initializing a pgbench database, to
prove they happen in the lab. The attached test program takes as its
input a pgbench scale counter. It then creates a pgbench database
(deleting any existing pgbench database, so watch out for that) and
shows the values accumulated in pg_stat_bgwriter during that period.
Here's an example, using the script's default scale of 100 on a server
with 8GB of RAM and fake fsync (the hard drives are lying about it):

-[ RECORD 1 ]--------+-----------------------------
now | 2010-11-14 16:08:41.36421-05
...
Initializing pgbench
-[ RECORD 1 ]--------+------------------------------
now | 2010-11-14 16:09:46.713693-05
checkpoints_timed | 0
checkpoints_req | 0
buffers_checkpoint | 0
buffers_clean | 0
maxwritten_clean | 0
buffers_backend | 654716
buffers_backend_sync | 90
buffers_alloc | 803

This is with default sizing for memory structures. As you increase
shared_buffers, one of the queues involved here increases
proportionately, making it less likely to run into this problem. That
just changes it to the kind of problem I've only seen on a larger system
with a difficult to simulate workload. The production system getting
hammered with this problem (running a web application) that prompted
writing the patch had shared_buffers=4GB at the time.

The patch also adds some logging to the innards involved here, to help
with understanding problems in this area. I don't think that should be
in the version committed as is. May want to drop the logging level or
make it disabled in regular builds, since it is sitting somewhere it
generates a lot of log data and adds overhead. It is nice for now, as
it lets you get an idea how much fsync work *is* being absorbed by the
BGW, as well as showing what relation is suffering from this issue.
Example of both those things, with the default config for everything
except log_checkpoints (on) and log_min_messages (debug1):

DEBUG: Absorbing 4096 fsync requests
DEBUG: Absorbing 150 fsync requests
DEBUG: Unable to forward fsync request, executing directly
CONTEXT: writing block 158638 of relation base/16385/16398

Here 4096 is the most entries the BGW will ever absorb at once, and all
90 of the missed sync calls are logged so you can see what files they
came from.

As a high-level commentary about this patch, I'm not sure what most end
users will ever do with this data. At the same time, I wasn't sure what
a typical end user would do with anything else in pg_stat_bgwriter
either when it was added, and it turns out the answer is "wait for
people who know the internals to write things that monitor it". For
example, Magnus has updated recent versions of the Munin plug-in for
PostgreSQL to usefully graph pg_stat_bgwriter data over time. As I have
some data to suggest checkpoint problems on Linux in particular are
getting worse as total system memory increases, I expect that having a
way to easily instrument for this particular problem will be
correspondingly more important in the future too.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

Attachment Content-Type Size
backend-syncs-v2.patch text/x-patch 13.3 KB
fsync-stress-v2.sh application/x-sh 349 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2010-11-14 22:08:19 Re: Instrument checkpoint sync calls
Previous Message Yeb Havinga 2010-11-14 22:02:08 Re: wCTE behaviour