New wal_sync_method for Darwin?

Lists: pgsql-hackers
From: Peter Bierman <bierman(at)apple(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Fwd: Apple Darwin disabled fsync?
Date: 2005-02-20 02:43:14
Message-ID: a06010200be3da9564694@[17.202.21.231]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Date: Sat, 19 Feb 2005 17:59:21 -0800
>From: Dominic Giampaolo <dbg(at)apple(dot)com>
>Subject: Re: bad fsync? (A.M.)
>To: darwin-dev(at)lists(dot)apple(dot)com
>
>>MySQL makes the following claim at:
>>http://dev.mysql.com/doc/mysql/en/news-4-1-9.html
>>
>>"InnoDB: Use the fcntl() file flush method on Mac OS X versions 10.3
>>and up. Apple had disabled fsync() in Mac OS X for internal disk
>>drives, which caused corruption at power outages."
>>
>>First of all, is this accurate? A pointer to some docs or a tech note
>>on this would be helpful.
>>
>The comments about fsync() are wrong...
>
>On MacOS X, fsync() always has and always will flush all file data
>from host memory to the drive on which the file resides. The behavior
>of fsync() on MacOS X is the same as it is on every other version of
>Unix since the dawn of time (well, since the introduction of fsync
>anyway :-).
>
>I believe that what the above comment refers to is the fact that
>fsync() is not sufficient to guarantee that your data is on stable
>storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
>to ask the drive to flush all buffered data to stable storage.
>
>Let me explain in more detail. With fsync() even though the OS
>writes the data through to the disk and the disk says "yes I wrote
>the data", the data is not actually on permanent storage. Unless
>you explicitly disable it, all disks have a write buffer which holds
>data you've written. The disk buffers the data you wrote until it
>decides to flush it to the platters (and the writes may not be in
>the order you wrote them). If you lose power or the system crashes
>before the data is written, you can wind up in a situation where only
>some of your data is actually on disk. What is worse is that even if
>you write blocks A, B and C, call fsync() and then write block D you
>may find after rebooting that blocks A and D are on disk but B and C
>are not (in fact any ordering of A, B, C, and D is possible).
>
>While this may seem like a rare case it is not. In fact if you sit
>down and pull the plug on a system you can make it happen in one or
>two plug pulls. I have even gone so far as to watch this behavior
>with a logic analyzer on the ATA bus: I saw the data for two writes
>come across the ATA cable, the drive replied and said the writes were
>successful and then when we rebooted the data from the second write
>was correct on disk but the data from the first write was not.
>
>To deal with this we introduced the F_FULLFSYNC fcntl which will ask
>the drive to flush all of its buffered data to disk. When an app
>needs to guarantee that data is on disk it should use F_FULLFSYNC.
>In most cases you do not need such a heavy handed operation and
>fsync() is good enough. But in an app like a database, it is
>essential if you want transactional integrity.
>
>Now, a little bit more detail: on ATA drives we implement F_FULLFSYNC
>with the FLUSH_TRACK_CACHE command. All drives sold by Apple will
>honor this command. Unfortunately quite a few firewire drive vendors
>disable this command and do not pass it to the drive. This means that
>most external firewire drives are not reliable if you lose power or
>the system crashes. We can't work-around that unless we ask the drive
>to disable the write cache completely (which hurts performance quite
>badly -- and even that may not be enough as some drives will ignore
>that request too).
>
>So in summary, I believe that the comments in the MySQL news posting
>are slightly confused. On MacOS X fsync() behaves the same as it does
>on all Unices. That's not good enough if you really care about data
>integrity and so we also provide the F_FULLFSYNC fcntl. As far as I
>know, MacOS X is the only OS to provide this feature for apps that
>need to truly guarantee their data is on disk.
>
>Hope this clears things up.
>
>--dominic


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Bierman <bierman(at)apple(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fwd: Apple Darwin disabled fsync?
Date: 2005-02-20 05:38:49
Message-ID: 11497.1108877929@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Bierman <bierman(at)apple(dot)com> writes:
>> I believe that what the above comment refers to is the fact that
>> fsync() is not sufficient to guarantee that your data is on stable
>> storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
>> to ask the drive to flush all buffered data to stable storage.

I've been looking for documentation on this without a lot of luck
("man fcntl" on OS X 10.3.8 has certainly never heard of it).
It's not completely clear whether this subsumes fsync() or whether
you're supposed to fsync() and then use the fcntl.

Also, isn't it fundamentally at the wrong level? One would suppose that
the drive flush operation is going to affect everything the drive
currently has queued, not just the one file. That makes it difficult
if not impossible to use efficiently.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fwd: Apple Darwin disabled fsync?
Date: 2005-02-20 17:42:01
Message-ID: 87zmxzuo0m.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Peter Bierman <bierman(at)apple(dot)com> writes:

> > In most cases you do not need such a heavy handed operation and fsync() is
> > good enough.

Really? Can you think of a single application for which this definition of
fsync is useful?

Kernel buffers are transparent to the application, just as the disk buffer is.
It doesn't matter to an application whether the data is sitting in a kernel
buffer, or a buffer in the disk, it's equivalent. If fsync doesn't guarantee
the writes actually end up on non-volatile disk then as far as the application
is concerned it's just an expensive noop.

--
greg


From: Peter Bierman <bierman(at)apple(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fwd: Apple Darwin disabled fsync?
Date: 2005-02-21 02:12:25
Message-ID: a06010200be3eebfde545@[17.202.21.231]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 12:38 AM -0500 2/20/05, Tom Lane wrote:
>Dominic Giampaolo <dbg(at)apple(dot)com> writes:
>>> I believe that what the above comment refers to is the fact that
>>> fsync() is not sufficient to guarantee that your data is on stable
>>> storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
>>> to ask the drive to flush all buffered data to stable storage.
>
>I've been looking for documentation on this without a lot of luck
>("man fcntl" on OS X 10.3.8 has certainly never heard of it).
>It's not completely clear whether this subsumes fsync() or whether
>you're supposed to fsync() and then use the fcntl.

My understanding is that you're supposed to fsync() and then use the
fcntl, but I'm not the filesystems expert. (Dominic, who wrote the
original message that I forwarded, is.)

I've filed a bug report asking for better documentation about this to
be placed in the fsync man page. <radar://4012378>

>Also, isn't it fundamentally at the wrong level? One would suppose that
>the drive flush operation is going to affect everything the drive
>currently has queued, not just the one file. That makes it difficult
>if not impossible to use efficiently.

I think the intent is to make the fcntl more accurate in time, as the
ability to do so appears in hardware.

One of the advantages Apple has is the ability to set very specific
requirements for our hardware. So if a block specific flush command
becomes part of the ATA spec, Apple can require vendors to support
it, and support it correctly, before using those drives.

On the other hand, as Dominic described, once the hardware is
external (like a firewire enclosure), we lose that leverage.

At 12:42 PM -0500 2/20/05, Greg Stark wrote:
>Dominic Giampaolo <dbg(at)apple(dot)com> writes:
>
>> > In most cases you do not need such a heavy handed operation and fsync() is
>> > good enough.
>
>Really? Can you think of a single application for which this definition of
>fsync is useful?
>
>Kernel buffers are transparent to the application, just as the disk buffer is.
>It doesn't matter to an application whether the data is sitting in a kernel
>buffer, or a buffer in the disk, it's equivalent. If fsync doesn't guarantee
>the writes actually end up on non-volatile disk then as far as the application
>is concerned it's just an expensive noop.

I think the intent of fsync() is closer to what you describe, but the
convention is that fsync() hands responsibility to the disk hardware.
That's how every other Unix seems to handle fsync() too. This gives
you good performance, and if you combine a smart fsync()ing
application with reliable storage hardware (like an XServe RAID that
battery backs it's own write caches), you get the best combination.

If you know you have unreliable hardware, and critical reliability
issues, then you can use the fcntl, which seems to be more control
than other OSes give.

-pmb


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fwd: Apple Darwin disabled fsync?
Date: 2005-02-21 03:50:35
Message-ID: 87ll9ivaes.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Peter Bierman <bierman(at)apple(dot)com> writes:

> I think the intent of fsync() is closer to what you describe, but the
> convention is that fsync() hands responsibility to the disk hardware.

The "convention" was also that the hardware didn't confirm the command until
it had actually been executed...

None of this matters to the application. A specification for fsync(2) that
says it forces the data to be shuffled around under the hood but fundamentally
the doesn't change the semantics (that the data isn't guaranteed to be in
non-volatile storage) means that fsync didn't really do anything.

--
greg


From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fwd: Apple Darwin disabled fsync?
Date: 2005-02-22 05:37:41
Message-ID: 20050222053741.GL86914@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 20, 2005 at 10:50:35PM -0500, Greg Stark wrote:
>
> Peter Bierman <bierman(at)apple(dot)com> writes:
>
> > I think the intent of fsync() is closer to what you describe, but the
> > convention is that fsync() hands responsibility to the disk hardware.
>
> The "convention" was also that the hardware didn't confirm the command until
> it had actually been executed...
>
> None of this matters to the application. A specification for fsync(2) that
> says it forces the data to be shuffled around under the hood but fundamentally
> the doesn't change the semantics (that the data isn't guaranteed to be in
> non-volatile storage) means that fsync didn't really do anything.

The real issue is this isn't specific to OS X. I know FreeBSD enables
write-caching on IDE drives by default, and I suspect linux does as
well. It's probably worth adding a big, fat WARNING in the docs in
strategic places about this.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


From: Chris Campbell <chris(at)bignerdranch(dot)com>
To: pg_hackers(at)postgresql(dot)org
Subject: New wal_sync_method for Darwin?
Date: 2005-04-14 20:28:50
Message-ID: 7988ba02eb01d8532576baa979a52be4@bignerdranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think we should add a new wal_sync_method that will use Darwin's
F_FULLFSYNC fcntl().

From <sys/fnctl.h>:

#define F_FULLFSYNC 51 /* fsync + ask the drive to
flush to the media */

This fcntl() will basically perform an fsync() on the file, then flush
the write cache of the disk.

I'll attempt to work up the patch. It should be trivial. Might need
some help on the configure tests though (it should #include
<sys/fcntl.h> and make sure F_FULLFSYNC is defined).

What's an appropriate name? It seems equivalent to
"fsync_writethrough". I suggest "fsync_full", "fsync_flushdisk", or
something. Is there a reason we're not indicating the supported
platform in the name of the method? Would "fsync_darwinfull" be better?
Let users know that it's only available for Darwin? Should we do the
same thing with win32-specific methods?

I think both fsync() and F_FULLFSYNC should both be available as
options on Darwin. Currently in the code, "fsync" and
"fsync_writethrough" set sync_method to SYNC_METHOD_FSYNC, so there's
no way to distinguish between them.

Unsure which one would be the best default. fsync() matches the
semantics on other platforms. And conscientious users could specify the
F_FULLFSYNC fcntl() method if they want to make sure it goes through
the write cache.

Comments?

Thanks!

- Chris