checkpoint writeback via sync_file_range

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, Greg Smith <greg(at)2ndquadrant(dot)com>
Subject: checkpoint writeback via sync_file_range
Date: 2012-01-11 02:14:31
Message-ID: CA+TgmoaHu1zuNohoE=cEP0nSc+0wtuRSyEAj_Af2XhxU+ry6-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith muttered a while ago about wanting to do something with
sync_file_range to improve checkpoint behavior on Linux. I thought he
was talking about trying to sync only the range of blocks known to be
dirty, which didn't seem like a very exciting idea, but after looking
at the man page for sync_file_range, I think I understand what he was
really going for: sync_file_range allows you to hint the Linux kernel
that you'd like it to clean a certain set of pages. I further recall
from Greg's previous comments that in the scenarios he's seen,
checkpoint I/O spikes are caused not so much by the data written out
by the checkpoint itself but from the other dirty data in the kernel
buffer cache. Based on that, I whipped up the attached patch, which,
if sync_file_range is available, simply iterates through everything
that will eventually be fsync'd before beginning the write phase and
tells the Linux kernel to put them all under write-out.

I don't know that I have a suitable place to test this, and I'm not
quite sure what a good test setup would look like either, so while
I've tested that this appears to issue the right kernel calls, I am
not sure whether it actually fixes the problem case. But here's the
patch, anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
writeback-v1.patch application/octet-stream 13.2 KB

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 04:38:12
Message-ID: 4F0D1234.1020300@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/10/12 9:14 PM, Robert Haas wrote:
> Based on that, I whipped up the attached patch, which,
> if sync_file_range is available, simply iterates through everything
> that will eventually be fsync'd before beginning the write phase and
> tells the Linux kernel to put them all under write-out.

I hadn't really thought of using it that way. The kernel expects that
when this is called the normal way, you're going to track exactly which
segments you want it to sync. And that data isn't really passed through
the fsync absorption code yet; the list of things to fsync has already
lost that level of detail.

What you're doing here doesn't care though, and I hadn't considered that
SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
docs. Used this way, it's basically fsync without the wait or
guarantee; it just tries to push what's already dirty further ahead of
the write queue than those writes would otherwise be.

One idea I was thinking about here was building a little hash table
inside of the fsync absorb code, tracking how many absorb operations
have happened for whatever the most popular relation files are. The
idea is that we might say "use sync_file_range every time <N> calls for
a relation have come in", just to keep from ever accumulating too many
writes to any one file before trying to nudge some of it out of there.
The bat that keeps hitting me in the head here is that right now, a
single fsync might have a full 1GB of writes to flush out, perhaps
because it extended a table and then write more than that to it. And in
everything but a SSD or giant SAN cache situation, 1GB of I/O is just
too much to fsync at a time without the OS choking a little on it.

> I don't know that I have a suitable place to test this, and I'm not
> quite sure what a good test setup would look like either, so while
> I've tested that this appears to issue the right kernel calls, I am
> not sure whether it actually fixes the problem case.

I'll put this into my testing queue after the upcoming CF starts.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 09:28:11
Message-ID: CA+U5nM+2DwGECG3O0BAihqH8eEhegk-W9kp2Co4yh2u1o4iGBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 1/10/12 9:14 PM, Robert Haas wrote:
>>
>> Based on that, I whipped up the attached patch, which,
>> if sync_file_range is available, simply iterates through everything
>> that will eventually be fsync'd before beginning the write phase and
>> tells the Linux kernel to put them all under write-out.
>
>
> I hadn't really thought of using it that way.  The kernel expects that when
> this is called the normal way, you're going to track exactly which segments
> you want it to sync.  And that data isn't really passed through the fsync
> absorption code yet; the list of things to fsync has already lost that level
> of detail.
>
> What you're doing here doesn't care though, and I hadn't considered that
> SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
> docs.  Used this way, it's basically fsync without the wait or guarantee; it
> just tries to push what's already dirty further ahead of the write queue
> than those writes would otherwise be.

I don't think this will help at all, I think it will just make things worse.

The problem comes from hammering the fsyncs one after the other. What
this patch does is initiate all of the fsyncs at the same time, so it
will max out the disks even more because this will hit all disks all
at once.

It does open the door to various other uses, so I think this work will
be useful.

> One idea I was thinking about here was building a little hash table inside
> of the fsync absorb code, tracking how many absorb operations have happened
> for whatever the most popular relation files are.  The idea is that we might
> say "use sync_file_range every time <N> calls for a relation have come in",
> just to keep from ever accumulating too many writes to any one file before
> trying to nudge some of it out of there. The bat that keeps hitting me in
> the head here is that right now, a single fsync might have a full 1GB of
> writes to flush out, perhaps because it extended a table and then write more
> than that to it.  And in everything but a SSD or giant SAN cache situation,
> 1GB of I/O is just too much to fsync at a time without the OS choking a
> little on it.

A better idea. Seems like it should be easy enough to keep a counter.

I see some other uses around large writes also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 09:33:47
Message-ID: 82mx9u4m84.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Greg Smith:

> One idea I was thinking about here was building a little hash table
> inside of the fsync absorb code, tracking how many absorb operations
> have happened for whatever the most popular relation files are. The
> idea is that we might say "use sync_file_range every time <N> calls
> for a relation have come in", just to keep from ever accumulating too
> many writes to any one file before trying to nudge some of it out of
> there. The bat that keeps hitting me in the head here is that right
> now, a single fsync might have a full 1GB of writes to flush out,
> perhaps because it extended a table and then write more than that to
> it. And in everything but a SSD or giant SAN cache situation, 1GB of
> I/O is just too much to fsync at a time without the OS choking a
> little on it.

Isn't this pretty much like tuning vm.dirty_bytes? We generally set it
to pretty low values, and seems to help to smoothen the checkpoints.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 12:41:35
Message-ID: CA+U5nMJz4dc3TxNPHjnSidzXm1CoLM3YnxwbgqUseYWT2h+nxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 9:28 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> It does open the door to various other uses, so I think this work will
> be useful.

Yes, I think this would allow a better design for the checkpointer.

Checkpoint scan will collect buffers to write for checkpoint and sort
them by fileid, like Koichi/Itagaki already suggested.

We then do all the writes for a particular file, then issue a
background sync_file_range, then sleep a little. Loop. At end of loop,
collect up and close the sync_file_range calls with a
SYNC_FILE_RANGE_WAIT_AFTER.

So we're interleaving the writes and fsyncs throughout the whole
checkpoint, not bursting the fsyncs at the end.

With that design we would just have a continuous checkpoint, rather
than having 0,5 or 0.9

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 12:46:29
Message-ID: 201201111346.30167.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday, January 11, 2012 03:14:31 AM Robert Haas wrote:
> Greg Smith muttered a while ago about wanting to do something with
> sync_file_range to improve checkpoint behavior on Linux. I thought he
> was talking about trying to sync only the range of blocks known to be
> dirty, which didn't seem like a very exciting idea, but after looking
> at the man page for sync_file_range, I think I understand what he was
> really going for: sync_file_range allows you to hint the Linux kernel
> that you'd like it to clean a certain set of pages. I further recall
> from Greg's previous comments that in the scenarios he's seen,
> checkpoint I/O spikes are caused not so much by the data written out
> by the checkpoint itself but from the other dirty data in the kernel
> buffer cache. Based on that, I whipped up the attached patch, which,
> if sync_file_range is available, simply iterates through everything
> that will eventually be fsync'd before beginning the write phase and
> tells the Linux kernel to put them all under write-out.
I played around with this before and my problem was that sync_file_range is not
really a hint. It actually starts writeback *directly* and only returns when
the io is placed inside the queue (at least thats the way it was back then).
Which very quickly leads to it blocking all the time...

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 12:47:39
Message-ID: 201201111347.40173.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday, January 11, 2012 10:28:11 AM Simon Riggs wrote:
> On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> > On 1/10/12 9:14 PM, Robert Haas wrote:
> >> Based on that, I whipped up the attached patch, which,
> >> if sync_file_range is available, simply iterates through everything
> >> that will eventually be fsync'd before beginning the write phase and
> >> tells the Linux kernel to put them all under write-out.
> >
> > I hadn't really thought of using it that way. The kernel expects that
> > when this is called the normal way, you're going to track exactly which
> > segments you want it to sync. And that data isn't really passed through
> > the fsync absorption code yet; the list of things to fsync has already
> > lost that level of detail.
> >
> > What you're doing here doesn't care though, and I hadn't considered that
> > SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
> > docs. Used this way, it's basically fsync without the wait or guarantee;
> > it just tries to push what's already dirty further ahead of the write
> > queue than those writes would otherwise be.
>
> I don't think this will help at all, I think it will just make things
> worse.
>
> The problem comes from hammering the fsyncs one after the other. What
> this patch does is initiate all of the fsyncs at the same time, so it
> will max out the disks even more because this will hit all disks all
> at once.
The advantage of using sync_file_range that way is that it starts writeout but
doesn't cause queue drains/barriers/whatever to be issued which can be quite a
signfiicant speed gain. In theory.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Florian Weimer <fweimer(at)bfk(dot)de>, Greg Smith <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 12:51:38
Message-ID: 201201111351.38738.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday, January 11, 2012 10:33:47 AM Florian Weimer wrote:
> * Greg Smith:
> > One idea I was thinking about here was building a little hash table
> > inside of the fsync absorb code, tracking how many absorb operations
> > have happened for whatever the most popular relation files are. The
> > idea is that we might say "use sync_file_range every time <N> calls
> > for a relation have come in", just to keep from ever accumulating too
> > many writes to any one file before trying to nudge some of it out of
> > there. The bat that keeps hitting me in the head here is that right
> > now, a single fsync might have a full 1GB of writes to flush out,
> > perhaps because it extended a table and then write more than that to
> > it. And in everything but a SSD or giant SAN cache situation, 1GB of
> > I/O is just too much to fsync at a time without the OS choking a
> > little on it.
>
> Isn't this pretty much like tuning vm.dirty_bytes? We generally set it
> to pretty low values, and seems to help to smoothen the checkpoints.
If done correctly/way much more invasive you could only issue sync_file_range's
to the areas of the file where checkpointing needs to happen and you could
leave out e.g. hint bit only changes. Which could help to reduce the cost of
checkpoints.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 13:39:15
Message-ID: CA+TgmobXuvgwNpp3y0vMf6_1n_wDO3SV=DuZC75KM0avEkZ5PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 10, 2012 at 11:38 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> What you're doing here doesn't care though, and I hadn't considered that
> SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
> docs.  Used this way, it's basically fsync without the wait or guarantee; it
> just tries to push what's already dirty further ahead of the write queue
> than those writes would otherwise be.

Well, my goal was to make sure they got into the write queue rather
than just sitting in memory while the kernel twiddles its thumbs. My
hope is that the kernel is smart enough that, when you put something
under write-out, the kernel writes it out as quickly as it can without
causing too much degradation in foreground activity. If that turns
out to be an incorrect assumption, we'll need a different approach,
but I thought it might be worth trying something simple first and
seeing what happens.

> One idea I was thinking about here was building a little hash table inside
> of the fsync absorb code, tracking how many absorb operations have happened
> for whatever the most popular relation files are.  The idea is that we might
> say "use sync_file_range every time <N> calls for a relation have come in",
> just to keep from ever accumulating too many writes to any one file before
> trying to nudge some of it out of there. The bat that keeps hitting me in
> the head here is that right now, a single fsync might have a full 1GB of
> writes to flush out, perhaps because it extended a table and then write more
> than that to it.  And in everything but a SSD or giant SAN cache situation,
> 1GB of I/O is just too much to fsync at a time without the OS choking a
> little on it.

That's not a bad idea, but there's definitely some potential down
side: you might end up reducing write-combining quite significantly if
you keep pushing things out to files when it isn't really needed yet.
I was aiming to only push things out when we're 100% sure that they're
going to have to be fsync'd, and certainly any already-written buffers
that are in the OS cache at the start of a checkpoint fall into that
category. That having been said, experimental evidence is king.

> I'll put this into my testing queue after the upcoming CF starts.

Thanks!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Florian Weimer <fweimer(at)bfk(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 14:12:30
Message-ID: 4F0D98CE.4000607@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/11/12 4:33 AM, Florian Weimer wrote:
> Isn't this pretty much like tuning vm.dirty_bytes? We generally set it
> to pretty low values, and seems to help to smoothen the checkpoints.

When I experimented with dropping the actual size of the cache,
checkpoint spikes improved, but things like VACUUM ran terribly slow.
On a typical medium to large server nowadays (let's say 16GB+),
PostgreSQL needs to have gigabytes of write cache for good performance.

What we're aiming to here is keep the benefits of having that much write
cache, while allowing checkpoint related work to send increasingly
strong suggestions about ordering what it needs written soon. There's
basically three primary states on Linux to be concerned about here:

Dirty: in the cache via standard write
|
v pdflush does writeback at 5 or 10% dirty || sync_file_range push
|
Writeback
|
v write happens in the background || fsync call
|
Stored on disk

The systems with bad checkpoint problems will typically have gigabytes
"Dirty", which is necessary for good performance. It's very lazy about
pushing things toward "Writeback" though. Getting the oldest portions
of the outstanding writes into the Writeback queue more aggressively
should make the eventual fsync less likely to block.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 14:20:09
Message-ID: 4F0D9A99.6010108@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/11/12 7:46 AM, Andres Freund wrote:
> I played around with this before and my problem was that sync_file_range is not
> really a hint. It actually starts writeback *directly* and only returns when
> the io is placed inside the queue (at least thats the way it was back then).
> Which very quickly leads to it blocking all the time...

Right, you're answering one of Robert's questions here: yes, once
something is pushed toward writeback, it moves toward an actual write
extremely fast. And the writeback queue can fill itself. But we don't
really care if this blocks. There's a checkpointer process, it will be
doing this work, and it has no other responsibilities anymore (as of
9.2, which is why some of these approaches suddenly become practical).
It's going to get blocked waiting for things sometimes, the way it
already does rarely when it writes, and often when it call fsync.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 14:25:13
Message-ID: 201201111525.13818.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday, January 11, 2012 03:20:09 PM Greg Smith wrote:
> On 1/11/12 7:46 AM, Andres Freund wrote:
> > I played around with this before and my problem was that sync_file_range
> > is not really a hint. It actually starts writeback *directly* and only
> > returns when the io is placed inside the queue (at least thats the way
> > it was back then). Which very quickly leads to it blocking all the
> > time...
>
> Right, you're answering one of Robert's questions here: yes, once
> something is pushed toward writeback, it moves toward an actual write
> extremely fast. And the writeback queue can fill itself. But we don't
> really care if this blocks. There's a checkpointer process, it will be
> doing this work, and it has no other responsibilities anymore (as of
> 9.2, which is why some of these approaches suddenly become practical).
> It's going to get blocked waiting for things sometimes, the way it
> already does rarely when it writes, and often when it call fsync.
We do care imo. The heavy pressure putting it directly in the writeback queue
leads to less efficient io because quite often it won't reorder sensibly with
other io anymore and thelike. At least that was my experience in using it with
in another application.
Lots of that changed with linux 3.2 (near complete rewrite of the writeback
mechanism), so a bit of that might be moot anyway.

I definitely aggree that 9.2 opens new possibilities there.

Andres


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-13 03:26:12
Message-ID: 4F0FA454.3030604@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/11/12 9:25 AM, Andres Freund wrote:
> The heavy pressure putting it directly in the writeback queue
> leads to less efficient io because quite often it won't reorder sensibly with
> other io anymore and thelike. At least that was my experience in using it with
> in another application.

Sure, this is one of the things I was cautioning about in the Double
Writes thread, with VACUUM being the worst such case I've measured.

The thing to realize here is that the data we're talking about must be
flushed to disk in the near future. And Linux will happily cache
gigabytes of it. Right now, the database asks for that to be forced to
disk via fsync, which means in chunks that can be large as a gigabyte.

Let's say we have a traditional storage array and there's competing
activity. 10MB/s would be a good random I/O write rate in that
situation. A single fsync that forces 1GB out at that rate will take
*100 seconds*. And I've seen exactly that when trying to--about 80
seconds is my current worst checkpoint stall ever.

And we don't have a latency vs. throughput knob any finer than that. If
one is added, and you turn it too far toward latency, throughput is
going to tank for the reasons you've also seen. Less reordering,
elevator sorting, and write combining. If the database isn't going to
micro-manage the writes, it needs to give the OS room to do that work
for it.

The most popular OS level approach to adjusting for this trade-off seems
to be "limit the cache size". That hasn't worked out very well when
I've tried it, again getting back to not having enough working room for
writes queued to reorganize them usefully. One theory I've considered
is that we might improve the VACUUM side of that using the same
auto-tuning approach that's been applied to two other areas now: scale
the maximum size of the ring buffers based on shared_buffers. I'm not
real confident in that idea though, because ultimately it won't change
the rate at which dirty buffers from VACUUM are evicted--and that's the
source of the bottleneck in that area.

There is one piece of information the database knows, but it isn't
communicating well to the OS yet. I could do a better job of advising
how to prioritize the writes that must happen soon--but not necessarily
right now. Yes, forcing them into write-back will be counterproductive
from a throughput perspective. The longer they sit at the "Dirty" cache
level above that, the better the odds they'll be done efficiently. But
this is the checkpoint process we're talking about here. It's going to
force the information to disk soon regardless. An intermediate step
pushing to write-back should give the OS a bit more room to move around
than fsync does, making the potential for a latency gain here seem quite
real. We'll see how the benchmarking goes.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-13 17:08:51
Message-ID: CAMkU=1yRO-a9i-OoVYe8xj00J_LgO34DXPgnysqT3U3xGbnb_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 12, 2012 at 7:26 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 1/11/12 9:25 AM, Andres Freund wrote:
>>
>> The heavy pressure putting it directly in the writeback queue
>> leads to less efficient io because quite often it won't reorder sensibly
>> with
>> other io anymore and thelike. At least that was my experience in using it
>> with
>> in another application.
>
>
> Sure, this is one of the things I was cautioning about in the Double Writes
> thread, with VACUUM being the worst such case I've measured.
>
> The thing to realize here is that the data we're talking about must be
> flushed to disk in the near future.  And Linux will happily cache gigabytes
> of it.  Right now, the database asks for that to be forced to disk via
> fsync, which means in chunks that can be large as a gigabyte.
>
> Let's say we have a traditional storage array and there's competing
> activity.  10MB/s would be a good random I/O write rate in that situation.
>  A single fsync that forces 1GB out at that rate will take *100 seconds*.
>  And I've seen exactly that when trying to--about 80 seconds is my current
> worst checkpoint stall ever.
>
> And we don't have a latency vs. throughput knob any finer than that.  If one
> is added, and you turn it too far toward latency, throughput is going to
> tank for the reasons you've also seen.  Less reordering, elevator sorting,
> and write combining.  If the database isn't going to micro-manage the
> writes, it needs to give the OS room to do that work for it.

Are there any IO benchmarking tools out there that benchmark the
effects of reordering, elevator sorting, write combining, etc.?

What I've seen is basically either "completely sequential" or
"completely random" with not much in between.

Cheers,

Jeff