Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spread checkpoint sync
Date: 2010-11-21 21:54:00
Message-ID: 4CE994F8.8020800@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> Doing all the writes and then all the fsyncs meets this requirement
> trivially, but I'm not so sure that's a good idea. For example, given
> files F1 ... Fn with dirty pages needing checkpoint writes, we could
> do the following: first, do any pending fsyncs for files not among F1
> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
> and fsync, write all pages for F3 and fsync, etc. This might seem
> dumb because we're not really giving the OS a chance to write anything
> out before we fsync, but think about the ext3 case where the whole
> filesystem cache gets flushed anyway.

I'm not horribly interested in optimizing for the ext3 case per se, as I
consider that filesystem fundamentally broken from the perspective of
its ability to deliver low-latency here. I wouldn't want a patch that
improved behavior on filesystem with granular fsync to make the ext3
situation worst. That's as much as I'd want design to lean toward
considering its quirks. Jeff Janes made a case downthread for "why not
make it the admin/OS's job to worry about this?" In cases where there
is a reasonable solution available, in the form of "switch to XFS or
ext4", I'm happy to take that approach.

Let me throw some numbers out to give a better idea of the shape and
magnitude of the problem case I've been working on here. In the
situation that leads that the near hour-long sync phase I've seen,
checkpoints will start with about a 3GB backlog of data in the kernel
write cache to deal with. That's about 4% of RAM, just under the 5%
threshold set by dirty_background_ratio. Whether or not the 256MB write
cache on the controller is also filled is a relatively minor detail I
can't monitor easily. The checkpoint itself? <250MB each time.

This proportion is why I didn't think to follow the alternate path of
worrying about spacing the write and fsync calls out differently. I
shrunk shared_buffers down to make the actual checkpoints smaller, which
helped to some degree; that's what got them down to smaller than the
RAID cache size. But the amount of data cached by the operating system
is the real driver of total sync time here. Whether or not you include
all of the writes from the checkpoint itself before you start calling
fsync didn't actually matter very much; in the case I've been chasing,
those are getting cached anyway. The write storm from the fsync calls
themselves forcing things out seems to be the driver on I/O spikes,
which is why I started with spacing those out.

Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog
takes a minimum of 10 minutes of real time. There are about 300 1GB
relation files involved in the case I've been chasing. This is where
the 3 second delay number came from; 300 files, 3 seconds each, 900
seconds = 15 minutes of sync spread. You can turn that math around to
figure out how much delay per relation you can afford while still
keeping checkpoints to a planned end time, which isn't done in the patch
I submitted yet.

Ultimately what I want to do here is some sort of smarter write-behind
sync operation, perhaps with a LRU on relations with pending fsync
requests. The idea would be to sync relations that haven't been touched
in a while in advance of the checkpoint even. I think that's similar to
the general idea Robert is suggesting here, to get some sync calls
flowing before all of the checkpoint writes have happened. I think that
the final sync calls will need to get spread out regardless, and since
doing that requires a fairly small amount of code too that's why we
started with that.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2010-11-21 22:19:30 Re: Spread checkpoint sync
Previous Message Dimitri Fontaine 2010-11-21 21:47:56 Re: ALTER OBJECT any_name SET SCHEMA name