ext4 finally doing the right thing

Lists: pgsql-performance
From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: ext4 finally doing the right thing
Date: 2010-01-16 03:05:49
Message-ID: 4B512D0D.4030909@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

A few months ago the worst of the bugs in the ext4 fsync code started
clearing up, with
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f3481e9a80c240f169b36ea886e2325b9aeb745
as a particularly painful one. That made it into the 2.6.32 kernel
released last month. Some interesting benchmark news today suggests a
version of ext4 that might actually work for databases is showing up in
early packaged distributions:

http://www.phoronix.com/scan.php?page=article&item=ubuntu_lucid_alpha2&num=3

Along with the massive performance drop that comes from working fsync.
See
http://www.phoronix.com/scan.php?page=article&item=linux_perf_regressions&num=2
for background about this topic from when the issue was discovered:

"[This change] is required for safe behavior with volatile write caches
on drives. You could mount with -o nobarrier and [the performance drop]
would go away, but a sequence like write->fsync->lose power->reboot may
well find your file without the data that you synced, if the drive had
write caches enabled. If you know you have no write cache, or that it
is safely battery backed, then you can mount with -o nobarrier, and not
incur this penalty."

The pgbench TPS figure Phoronix has been reporting has always been a
fictitious one resulting from unsafe write caching. With 2.6.32
released with ext4 defaulting to proper behavior on fsync, that's going
to make for a very interesting change. On one side, we might finally be
able to use regular drives with their caches turned on safely, taking
advantage of the cache for other writes while doing the right thing with
the database writes. On the other, anyone who believed the fictitious
numbers before is going to be in a rude surprise and think there's a
massive regression here. There's some potential for this to show
PostgreSQL in a bad light, when people discover they really only can get
~100 commits/second out of cheap hard drives and assume the database is
to blame. Interesting times.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.co


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-19 23:28:34
Message-ID: 1263943714.13109.24.camel@monkey-cat.sm.truviso.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, 2010-01-15 at 22:05 -0500, Greg Smith wrote:
> A few months ago the worst of the bugs in the ext4 fsync code started
> clearing up, with
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f3481e9a80c240f169b36ea886e2325b9aeb745
> as a particularly painful one.

Wow, thanks for the heads-up!

> On one side, we might finally be
> able to use regular drives with their caches turned on safely, taking
> advantage of the cache for other writes while doing the right thing with
> the database writes.

That could be good news. What's your opinion on the practical
performance impact? If it doesn't need to be fsync'd, the kernel
probably shouldn't have written it to the disk yet anyway, right (I'm
assuming here that the OS buffer cache is much larger than the disk
write cache)?

Regards,
Jeff Davis


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-20 21:18:48
Message-ID: 4B577338.4090608@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Jeff Davis wrote:
>
>> On one side, we might finally be
>> able to use regular drives with their caches turned on safely, taking
>> advantage of the cache for other writes while doing the right thing with
>> the database writes.
>>
>
> That could be good news. What's your opinion on the practical
> performance impact? If it doesn't need to be fsync'd, the kernel
> probably shouldn't have written it to the disk yet anyway, right (I'm
> assuming here that the OS buffer cache is much larger than the disk
> write cache)?
>

I know they just tweaked this area recently so this may be a bit out of
date, but kernels starting with 2.6.22 allow you to get up to 10% of
memory dirty before getting really aggressive about writing things out,
with writes starting to go heavily at 5%. So even with a 1GB server,
you could easily find 100MB of data sitting in the kernel buffer cache
ahead of a database write that needs to hit disc. Once you start
considering the case with modern hardware, where even my desktop has 8GB
of RAM and most serious servers I see have 32GB, you can easily have
gigabytes of such data queued in front of the write that now needs to
hit the platter.

The dream is that a proper barrier implementation will then shuffle your
important write to the front of that queue, without waiting for
everything else to clear first. The exact performance impact depends on
how many non-database writes happen. But even on a dedicated database
disk, it should still help because there are plenty of non-sync'd writes
coming out the background writer via its routine work and the checkpoint
writes. And the ability to fully utilize the write cache on the
individual drives, on commodity hardware, without risking database
corruption would make life a lot easier.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Greg Stark <stark(at)mit(dot)edu>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 05:15:40
Message-ID: 407d949e1001202115k72e98b8eg9b6aebc127319328@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

That doesn't sound right. The kernel having 10% of memory dirty doesn't mean
there's a queue you have to jump at all. You don't get into any queue until
the kernel initiates write-out which will be based on the usage counters --
basically a lru. fsync and cousins like sync_file_range and
posix_fadvise(DONT_NEED) in initiate write-out right away.

How many pending write-out requests for how much data the kernel should keep
active is another question but I imagine it has more to do with storage
hardware than how much memory your system has. And for most hardware it's
probably on the order of megabytes or less.

greg

On 20 Jan 2010 21:19, "Greg Smith" <greg(at)2ndquadrant(dot)com> wrote:

Jeff Davis wrote: > > >> On one side, we might finally be >> able to use
regular drives with their ...
I know they just tweaked this area recently so this may be a bit out of
date, but kernels starting with 2.6.22 allow you to get up to 10% of memory
dirty before getting really aggressive about writing things out, with writes
starting to go heavily at 5%. So even with a 1GB server, you could easily
find 100MB of data sitting in the kernel buffer cache ahead of a database
write that needs to hit disc. Once you start considering the case with
modern hardware, where even my desktop has 8GB of RAM and most serious
servers I see have 32GB, you can easily have gigabytes of such data queued
in front of the write that now needs to hit the platter.

The dream is that a proper barrier implementation will then shuffle your
important write to the front of that queue, without waiting for everything
else to clear first. The exact performance impact depends on how many
non-database writes happen. But even on a dedicated database disk, it
should still help because there are plenty of non-sync'd writes coming out
the background writer via its routine work and the checkpoint writes. And
the ability to fully utilize the write cache on the individual drives, on
commodity hardware, without risking database corruption would make life a
lot easier.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 05:58:13
Message-ID: 4B57ECF5.7050502@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark wrote:
>
> That doesn't sound right. The kernel having 10% of memory dirty
> doesn't mean there's a queue you have to jump at all. You don't get
> into any queue until the kernel initiates write-out which will be
> based on the usage counters -- basically a lru. fsync and cousins like
> sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out
> right away.
>

Most safe ways ext3 knows how to initiate a write-out on something that
must go (because it's gotten an fsync on data there) requires flushing
every outstanding write to that filesystem along with it. So as soon as
a single WAL write shows up, bam! The whole cache is emptied (or at
least everything associated with that filesystem), and the caller who
asked for that little write is stuck waiting for everything to clear
before their fsync returns success.

This particular issue absolutely killed Firefox when they switched to
using SQLite not too long ago; high-level discussion at
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and
confirmation/discussion of the issue on lkml at
https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 .

Note the comment from the first article saying "those delays can be 30
seconds or more". On multiple occasions, I've measured systems with
dozens of disks in a high-performance RAID1+0 with battery-backed
controller that could grind to a halt for 10, 20, or more seconds in
this situation, when running pgbench on a big database. As was the case
on the latest one I saw, if you've got 32GB of RAM and have let 3.2GB of
random I/O from background writer/checkpoint writes back up because
Linux has been lazy about getting to them, that takes a while to clear
no matter how good the underlying hardware.

Write barriers were supposed to improve all this when added to ext3, but
they just never seemed to work right for many people. After reading
that lkml thread, among others, I know I was left not trusting anything
beyond the simplest path through this area of the filesystem. Slow is
better than corrupted.

So the good news I was relaying is that it looks like this finally work
on ext4, giving it the behavior you described and expected, but that's
not actually been there until now. I was hoping someone with more free
time than me might be interested to go investigate further if I pointed
the advance out. I'm stuck with too many production systems to play
with new kernels at the moment, but am quite curious.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Greg Stark <stark(at)mit(dot)edu>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 11:13:42
Message-ID: 407d949e1001210313w1668d7e2jaee3b4d7984a059@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Both of those refer to the *drive* cache.

greg

On 21 Jan 2010 05:58, "Greg Smith" <greg(at)2ndquadrant(dot)com> wrote:

Greg Stark wrote: > > > That doesn't sound right. The kernel having 10% of
memory dirty doesn't mean...
Most safe ways ext3 knows how to initiate a write-out on something that must
go (because it's gotten an fsync on data there) requires flushing every
outstanding write to that filesystem along with it. So as soon as a single
WAL write shows up, bam! The whole cache is emptied (or at least everything
associated with that filesystem), and the caller who asked for that little
write is stuck waiting for everything to clear before their fsync returns
success.

This particular issue absolutely killed Firefox when they switched to using
SQLite not too long ago; high-level discussion at
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and
confirmation/discussion of the issue on lkml at
https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 .
Note the comment from the first article saying "those delays can be 30
seconds or more". On multiple occasions, I've measured systems with dozens
of disks in a high-performance RAID1+0 with battery-backed controller that
could grind to a halt for 10, 20, or more seconds in this situation, when
running pgbench on a big database. As was the case on the latest one I saw,
if you've got 32GB of RAM and have let 3.2GB of random I/O from background
writer/checkpoint writes back up because Linux has been lazy about getting
to them, that takes a while to clear no matter how good the underlying
hardware.

Write barriers were supposed to improve all this when added to ext3, but
they just never seemed to work right for many people. After reading that
lkml thread, among others, I know I was left not trusting anything beyond
the simplest path through this area of the filesystem. Slow is better than
corrupted.

So the good news I was relaying is that it looks like this finally work on
ext4, giving it the behavior you described and expected, but that's not
actually been there until now. I was hoping someone with more free time
than me might be interested to go investigate further if I pointed the
advance out. I'm stuck with too many production systems to play with new
kernels at the moment, but am quite curious.

-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services
and Support greg(at)2ndQu(dot)(dot)(dot)


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 13:51:29
Message-ID: 20100121135129.GQ18076@oak.highrise.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

* Greg Smith <greg(at)2ndquadrant(dot)com> [100121 00:58]:
> Greg Stark wrote:
>>
>> That doesn't sound right. The kernel having 10% of memory dirty
>> doesn't mean there's a queue you have to jump at all. You don't get
>> into any queue until the kernel initiates write-out which will be
>> based on the usage counters -- basically a lru. fsync and cousins like
>> sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out
>> right away.
>>
>
> Most safe ways ext3 knows how to initiate a write-out on something that
> must go (because it's gotten an fsync on data there) requires flushing
> every outstanding write to that filesystem along with it. So as soon as
> a single WAL write shows up, bam! The whole cache is emptied (or at
> least everything associated with that filesystem), and the caller who
> asked for that little write is stuck waiting for everything to clear
> before their fsync returns success.

Sure, if your WAL is on the same FS as your data, you're going to get
hit, and *especially* on ext3...

But, I think that's one of the reasons people usually recommend putting
WAL separate. Even if it's just another partition on the same (set of)
disk(s), you get the benefit of not having to wait for all the dirty
ext3 pages from your whole database FS to be flushed before the WAL write
can complete on it's own FS.

a.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 14:04:25
Message-ID: 824omflfkm.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

* Greg Smith:

> Note the comment from the first article saying "those delays can be 30
> seconds or more". On multiple occasions, I've measured systems with
> dozens of disks in a high-performance RAID1+0 with battery-backed
> controller that could grind to a halt for 10, 20, or more seconds in
> this situation, when running pgbench on a big database.

We see that quite a bit, too (we're still on ext3, mostly 2.6.26ish
kernels). It seems that the most egregious issues (which even trigger
the two minute kernel hangcheck timer) are related to CFQ. We don't
see it on systems we have switched to the deadline I/O scheduler. But
data on this is a bit sketchy.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Greg Stark <stark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 14:49:05
Message-ID: 4B586961.20401@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Aidan Van Dyk wrote:
> Sure, if your WAL is on the same FS as your data, you're going to get
> hit, and *especially* on ext3...
>
> But, I think that's one of the reasons people usually recommend putting
> WAL separate.

Separate disks can actually concentrate the problem. The writes to the
data disk by checkpoints will also have fsync behind them eventually, so
splitting out the WAL means you just push the big write backlog to a
later point. So less frequently performance dives, but sometimes
bigger. All of the systems I was mentioning seeing >10 second pauses on
had a RAID-1 pair of WAL disks split from the main array.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 15:05:10
Message-ID: 20100121150510.GB19549@oak.highrise.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

* Greg Smith <greg(at)2ndquadrant(dot)com> [100121 09:49]:
> Aidan Van Dyk wrote:
>> Sure, if your WAL is on the same FS as your data, you're going to get
>> hit, and *especially* on ext3...
>>
>> But, I think that's one of the reasons people usually recommend putting
>> WAL separate.
>
> Separate disks can actually concentrate the problem. The writes to the
> data disk by checkpoints will also have fsync behind them eventually, so
> splitting out the WAL means you just push the big write backlog to a
> later point. So less frequently performance dives, but sometimes
> bigger. All of the systems I was mentioning seeing >10 second pauses on
> had a RAID-1 pair of WAL disks split from the main array.

That's right, so with the WAL split off on it's own disk, you don't wait
on "WAL" for your checkpoint/data syncs, but you can build up a huge
wait in the queue for main data (which can even block reads).

Having WAL on the main disk means that (for most ext3), you sometimes
have WAL writes taking longer, but the WAL fsyncs are keeping the
backlog "down" in the main data area too.

Now, with ext4 moving to full barrier/fsync support, we could get to the
point where WAL in the main data FS can mimic the state where WAL is
seperate, namely that WAL writes can "jump the queue" and be written
without waiting for the data pages to be flushed down to disk, but also
that you'll get the big backlog of data pages to flush when
the first fsyncs on big data files start coming from checkpoints...

a.
--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <greg(at)2ndquadrant(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>,"Greg Stark" <stark(at)mit(dot)edu>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 15:54:26
Message-ID: 4B582452020000250002E95B@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> But, I think that's one of the reasons people usually recommend
> putting WAL separate. Even if it's just another partition on the
> same (set of) disk(s), you get the benefit of not having to wait
> for all the dirty ext3 pages from your whole database FS to be
> flushed before the WAL write can complete on it's own FS.

[slaps forehead]

I've been puzzling about why we're getting timeouts on one of two
apparently identical (large) servers. We forgot to move the pg_xlog
directory to the separate mount point we created for it on the same
RAID. I didn't think to check that until I saw your post.

-Kevin


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-21 16:36:47
Message-ID: op.u6v5rlilcke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


> Now, with ext4 moving to full barrier/fsync support, we could get to the
> point where WAL in the main data FS can mimic the state where WAL is
> seperate, namely that WAL writes can "jump the queue" and be written
> without waiting for the data pages to be flushed down to disk, but also
> that you'll get the big backlog of data pages to flush when
> the first fsyncs on big data files start coming from checkpoints...

Does postgres write something to the logfile whenever a fsync() takes a
suspiciously long amount of time ?


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: ext4 finally doing the right thing
Date: 2010-01-22 00:13:01
Message-ID: 4B58ED8D.2040804@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Pierre Frédéric Caillaud wrote:
>
> Does postgres write something to the logfile whenever a fsync()
> takes a suspiciously long amount of time ?

Not specifically. If you're logging statements that take a while, you
can see this indirectly, but commits that just take much longer than usual.

If you turn on log_checkpoints, the "sync time" is broken out for you,
problems in this area can show up there too.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com