Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spread checkpoint sync
Date: 2010-11-21 16:37:26
Message-ID: 4CE94AC6.4040409@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Janes wrote:
> And for very large memory
> systems, even 1% may be too much to cache (dirty*_ratio can only be
> set in integer percent points), so recent kernels introduced
> dirty*_bytes parameters. I like these better because they do what
> they say. With the dirty*_ratio, I could never figure out what it was
> a ratio of, and the results were unpredictable without extensive
> experimentation.
>

Right, you can't set dirty_background_ratio low enough to make this
problem go away. Even attempts to set it to 1%, back when that that was
the right size for it, seem to be defeated by other mechanisms within
the kernel. Last time I looked at the related source code, it seemed
the "congestion control" logic that kicks in to throttle writes was a
likely suspect. This is why I'm not real optimistic about newer
mechanism like the dirty_background_bytes added 2.6.29 to help here, as
that just gives a mapping to setting lower values; the same basic logic
is under the hood.

Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly
due to the same congestion mechanism.

> Yes, but how much work do we want to put into redoing the checkpoint
> logic so that the sysadmin on a particular OS and configuration and FS
> can avoid having to change the kernel parameters away from their
> defaults? (Assuming of course I am correctly understanding the
> problem, always a dangerous assumption.)
>

I've been trying to make this problem go away using just the kernel
tunables available since 2006. I adjusted them carefully on the server
that ran into this problem so badly that it motivated the submitted
patch, months before this issue got bad. It didn't help. Maybe if they
were running a later kernel that supported dirty_background_bytes that
would have worked better. During the last few years, the only thing
that has consistently helped in every case is the checkpoint spreading
logic that went into 8.3. I no longer expect that the kernel developers
will ever make this problem go away the way checkpoints are written out
right now, whereas the last good PostgreSQL work in this area definitely
helped.

The basic premise of the current checkpoint code is that if you write
all of the buffers out early enough, by the time syncs execute enough of
the data should have gone out that those don't take very long to
process. That was usually true for the last few years, on systems with
a battery-backed cache; the amount of memory cached by the OS was
relatively small relative to the RAID cache size. That's not the case
anymore, and that divergence is growing bigger.

The idea that the checkpoint sync code can run in a relatively tight
loop, without stopping to do the normal background writer cleanup work,
is also busted by that observation.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-11-21 17:31:02 Re: Improving prep_buildtree used in VPATH builds
Previous Message Robert Haas 2010-11-21 13:18:10 Re: Latches with weak memory ordering (Re: max_wal_senders must die)