Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-19 13:48:17
Message-ID: 51E943A1.9030702@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:
> Recently, a user who think system availability is important uses
> synchronous replication cluster.

If your argument for why it's OK to ignore bounding crash recovery on
the master is that it's possible to failover to a standby, I don't think
that is acceptable. PostgreSQL users certainly won't like it.

> I want you to read especially point that is line 631, 651, and 656.
> MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).

You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htm
to realize everything you're telling me about the writeback code and its
congestion logic I knew back in 2007. The situation is even worse than
you describe, because this section of Linux has gone through multiple,
major revisions since then. You can't just say "here is the writeback
source code"; you have to reference each of the commonly deployed
versions of the writeback feature to tell how this is going to play out
if released. There are four major ones I pay attention to. The old
kernel style as see in RHEL5/2.6.18--that's what my 2007 paper
discussed--the similar code but with very different defaults in 2.6.22,
the writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and then
there are newer kernels. (The newer ones separate out into a few
branches too, I haven't mapped those as carefully yet)

If you tried to model your feature on Linux's approach here, what that
means is that the odds of an ugly feedback loop here are even higher.
You're increasing the feedback on what's already a bad situation that
triggers trouble for people in the field. When Linux's congestion logic
causes checkpoint I/O spikes to get worse than they otherwise might be,
people panic because it seems like they stopped altogether. There are
some examples of what really bad checkpoints look like in
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf
if you want to see some of them. That's the talk I did around the same
time I was trying out spreading the database fsync calls out over a
longer period.

When I did that, checkpoints became even less predictable, and that was
a major reason behind why I rejected the approach. I think your
suggestion will have the same problem. You just aren't generating test
cases with really large write workloads yet to see it. You also don't
seem afraid of how exceeding the checkpoint timeout is a very bad thing yet.

> In addition, if you set a large value of a checkpoint_timeout or
> checkpoint_complete_taget, you have said that performance is improved,
> but is it true in all the cases?

The timeout, yes. Throughput is always improved by increasing
checkpoint_timeout. Less checkpoints per unit of time increases
efficiency. Less writes of the most heavy accessed buffers happen per
transaction. It is faster because you are doing less work, which on
average is always faster than doing more work. And doing less work
usually beats doing more work, but doing it smarter.

If you want to see how much work per transaction a test is doing, track
the numbers of buffers written at the beginning/end of your test via
pg_stat_bgwriter. Tests that delay checkpoints will show a lower total
number of writes per transaction. That seems more efficient, but it's
efficiency mainly gained by ignoring checkpoint_timeout.

> When a checkpoint complication target is actually enlarged,
> performance may fall in some cases. I think this as the last fsync
> having become heavy owing to having write in slowly.

I think you're confusing throughput and latency here. Increasing the
checkpoint timeout, or to a lesser extent the completion target, on
average that increases throughput. It results in less work, and the
more/less work amount is much more important than worrying about
scheduler details. Now matter how efficient a given write is, whether
you've sorted it across elevator horizon boundary A or boundary B, it's
better not do it at all.

But having less checkpoints makes latency worse sometimes too. Whether
latency or throughput is considered the more important thing is very
complicated. Having checkpoint_completion_target as the knob to control
the latency/throughput trade-off hasn't worked out very well. No one
has done a really comprehensive look at this trade-off since the 8.3
development. I got halfway through it for 9.1, we figured out that the
fsync queue filling was actually responsible for most of my result
variation, and then Robert fixed that. It was a big enough change that
all my data from before that I had to throw out as no longer relevant.

By the way: if you have a theory like "the last fsync having become
heavy" for why something is happening, measure it. Set log_min_messages
to debug2 and you'll get details about every single fsync in your logs.
I did that for all my tests that led me to conclude fsync delaying on
its own didn't help that problem. I was measuring my theories as
directly as possible.

> I would like to make a itemizing list which can be proof of my patch
> from you. Because DBT-2 benchmark spent lot of time about 1 setting test
> per 3 - 4 hours.

That's great, but to add some perspective here I have spent over 1 year
of my life running tests like this. The development cycle to do
something useful in this area is normally measured in months of machine
time running benchmarks, not hours or days. You're doing well so far,
but you're just getting started.

My itemized list is simple: throw out all results where the checkpoint
end goes more than 5% beyond its targets. When that happens, no matter
what you think is causing your gain, I will assume it's actually less
total writes that are improving things.

I'm willing to consider an optional, sloppy checkpoint approach that
uses heavy load to adjust how often checkpoints happen. But if we're
going to do that, it has to be extremely clear that the reason for the
gain is the checkpoint spacing--and there is going to be a crash
recovery time penalty paid for it. And this patch is not how I would do
that.

It's not really clear yet where the gains you're seeing are really
coming from. If you re-ran all your tests with pg_stat_bgwriter
before/after snapshots, logged every fsync call, and then build some
tools to analyze the fsync call latency, then you'll have enough data to
talk about this usefully. That's what I consider the bare minimum
evidence to consider changing something here. I have all of those
features in pgbench-tools with checkpoint logging turned way up, but
they're not all in the dbt2 toolset yet as far as I know.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-07-19 14:55:55 Re: Simple documentation typo patch
Previous Message tubadzin 2013-07-19 13:47:30 Adding new joining alghoritm to postgresql