Re: Spread checkpoint sync

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spread checkpoint sync
Date: 2010-11-20 23:21:48
Message-ID: AANLkTinN3+z83Dsjaca7ELPTxLYdkPuWTnDt3eaMViO7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 15, 2010 at 6:15 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>> The second issue is that the delay between sync calls is currently
>> hard-coded, at 3 seconds.  I believe the right path here is to consider the
>> current checkpoint_completion_target to still be valid, then work back from
>> there.  That raises the question of what percentage of the time writes
>> should now be compressed into relative to that, to leave some time to spread
>> the sync calls.  If we're willing to say "writes finish in first 1/2 of
>> target, syncs execute in second 1/2", that I could implement that here.
>>  Maybe that ratio needs to be another tunable.  Still thinking about that
>> part, and it's certainly open to community debate.

I would speculate that the answer is likely to be nearly binary. The
best option would either be to do the writes as fast as possible and
spread out the fsyncs, or spread out the writes and do the fsyncs as
fast as possible. Depending on the system set up.

>> The thing to realize
>> that complicates the design is that the actual sync execution may take a
>> considerable period of time.  It's much more likely for that to happen than
>> in the case of an individual write, as the current spread checkpoint does,
>> because those are usually cached.  In the spread sync case, it's easy for
>> one slow sync to make the rest turn into ones that fire in quick succession,
>> to make up for lost time.
>
> I think the behavior of file systems and operating systems is highly
> relevant here.  We seem to have a theory that allowing a delay between
> the write and the fsync should give the OS a chance to start writing
> the data out,

I thought that the theory was that doing too many fsync in short order
can lead to some kind of starvation of other IO.

If the theory is that we want to wait between writes and fsyncs, then
the current behavior is probably the best, Spreading out the writes
and then doing all the syncs at the end gives the best delay time
between an average write and the sync of that written to file. Or,
spread the writes out over 150 seconds, sleep for 140 seconds, then do
the fsyncs. But I don't think that that is the theory.

> but do we have any evidence indicating whether and under
> what circumstances that actually occurs?  For example, if we knew that
> it's important to wait at least 30 s but waiting 60 s is no better,
> that would be useful information.
>
> Another question I have is about how we're actually going to know when
> any given fsync can be performed.  For any given segment, there are a
> certain number of pages A that are already dirty at the start of the
> checkpoint.

Dirty in the shared pool, or dirty in the OS cache?

> Then there are a certain number of additional pages B
> that are going to be written out during the checkpoint.  If it so
> happens that B = 0, we can call fsync() at the beginning of the
> checkpoint without losing anything (in fact, we gain something: any
> pages dirtied by cleaning scans or backend writes during the
> checkpoint won't need to hit the disk;

Aren't those pages written out by cleaning scans and backend writes
while the checkpoint is occurring exactly what you defined to be page
set B, and then to be zero?

> and if the filesystem dumps
> more of its cache than necessary on fsync, we may as well take that
> hit before dirtying a bunch more stuff).  But if B > 0, then we should
> attempt the fsync() until we've written them all; otherwise we'll end
> up having to fsync() that segment twice.
>
> Doing all the writes and then all the fsyncs meets this requirement
> trivially, but I'm not so sure that's a good idea.  For example, given
> files F1 ... Fn with dirty pages needing checkpoint writes, we could
> do the following: first, do any pending fsyncs for files not among F1
> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
> and fsync, write all pages for F3 and fsync, etc.  This might seem
> dumb because we're not really giving the OS a chance to write anything
> out before we fsync, but think about the ext3 case where the whole
> filesystem cache gets flushed anyway.  It's much better to dump the
> cache at the beginning of the checkpoint and then again after every
> file than it is to spew many GB of dirty stuff into the cache and then
> drop the hammer.

But the kernel has knobs to prevent that from happening.
dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
kernels), dirty_expire_centisecs. Don't these knobs work? Also, ext3
is supposed to do a journal commit every 5 seconds under default mount
conditions.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-11-20 23:54:10 Re: knngist - 0.8
Previous Message Bruce Momjian 2010-11-20 23:07:56 Re: duplicate connection failure messages