Re: Final(?) proposal for wal_sync_method changes

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 16:24:14
Message-ID: 23293.1291739054@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

After reviewing the two ongoing threads about fixing the wal_sync_method
fiasco, I think there is general agreement on these two points:

1. open_datasync shouldn't be the default choice
2. O_DIRECT shouldn't be forcibly bundled in with O_DSYNC/O_SYNC

What I suggest we do about the latter is to invent two new
wal_sync_method values,
open_datasync_direct
open_sync_direct
which are defined only on platforms that define O_DIRECT (and O_DSYNC
or O_SYNC respectively). That puts it in the hands of the DBA whether
we try to use O_DIRECT or not. We'll still keep the hard-wired
optimization of disabling O_DIRECT when archiving or walreceiver are
active.

Dropping open_datasync as the first-choice default is something we have
to back-patch, but I'm less sure about it being a good idea to
back-patch the rearrangement of O_DIRECT management. Somebody who'd
explicitly specified open_sync or open_datasync as wal_sync_method
would find its behavior changing under him, which might be bad.

Comments?

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 17:11:46
Message-ID: 201012071811.46740.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 07 December 2010 17:24:14 Tom Lane wrote:
> After reviewing the two ongoing threads about fixing the wal_sync_method
> fiasco, I think there is general agreement on these two points:
>
> 1. open_datasync shouldn't be the default choice
> 2. O_DIRECT shouldn't be forcibly bundled in with O_DSYNC/O_SYNC
>
> What I suggest we do about the latter is to invent two new
> wal_sync_method values,
> open_datasync_direct
> open_sync_direct
> which are defined only on platforms that define O_DIRECT (and O_DSYNC
> or O_SYNC respectively). That puts it in the hands of the DBA whether
> we try to use O_DIRECT or not. We'll still keep the hard-wired
> optimization of disabling O_DIRECT when archiving or walreceiver are
> active.
>
> Dropping open_datasync as the first-choice default is something we have
> to back-patch, but I'm less sure about it being a good idea to
> back-patch the rearrangement of O_DIRECT management. Somebody who'd
> explicitly specified open_sync or open_datasync as wal_sync_method
> would find its behavior changing under him, which might be bad.
I vote for changing the order but not doing the O_DIRECT stuff on the
backbranches.

As I am not seeing myself or clients of mine ever using any O_*SYNC variant I
am not strongly opionated about what to do there. But I guess adding those two
variants is not really much work.

Thanks,

Andres


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 18:31:22
Message-ID: 4CFE7D7A.2010700@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/07/2010 08:24 AM, Tom Lane wrote:
> Dropping open_datasync as the first-choice default is something we have
> to back-patch, but I'm less sure about it being a good idea to
> back-patch the rearrangement of O_DIRECT management. Somebody who'd
> explicitly specified open_sync or open_datasync as wal_sync_method
> would find its behavior changing under him, which might be bad.

I agree for the backpatch that we should just swap to fdatasync as
default, and should not attempt to add the extra options.

In addition to the concerns above, adding new GUCS values in an update
release is something we should only do if required for a critical
security or data-loss bug. And this is neither.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 22:28:25
Message-ID: 29734.1291760905@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> I agree for the backpatch that we should just swap to fdatasync as
> default, and should not attempt to add the extra options.

I noticed while updating the documentation for this that the current
documentation is a flat-out lie. It claims that the preference order
for wal_sync_method is
open_datasync
fdatasync
fsync_writethrough
fsync
open_sync
ie you get the first-listed method that is supported on a given
platform. But this is not so: actually, fsync_writethrough will
be selected as default ONLY on Windows. There are other platforms
where the option exists, OS X being the one I have at hand. The
misstatement is masked on OS X because it also has open_datasync
and fdatasync. But since we are about to delete open_datasync from
the list, it's possible there are platforms where it will be exposed.

Oh, and just to add insult to injury, the above is what config.sgml
says, but postgresql.conf.sample says something different.

So I'm wondering whether we should correct the code to match the docs,
or vice versa. The former would just be a matter of saying
#elif defined(HAVE_FSYNC_WRITETHROUGH)
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH
in place of
#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH
To do the latter we'd have to say something like "The default is
fdatasync if it exists, else fsync, except on Windows where it is
fsync_writethrough".

Changing the code would result in a sudden, massive performance change
if there are any platforms for which fsync_writethrough exists but not
fdatasync. But I'm not sure if there are any.

Another point here is that it's not clear why we're selecting a
known-to-be-insecure default on OS X (where in fact all methods except
fsync_writethrough fail to push data to disk). We've been around on
that before, of course, and maybe now is not the time to change it.

Thoughts?

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 22:43:29
Message-ID: 4CFEB891.2030309@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/7/10 2:28 PM, Tom Lane wrote:
> Another point here is that it's not clear why we're selecting a
> known-to-be-insecure default on OS X (where in fact all methods except
> fsync_writethrough fail to push data to disk). We've been around on
> that before, of course, and maybe now is not the time to change it.

Because nobody sane uses OSX on the server?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:11:22
Message-ID: 548.1291763482@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> On 12/7/10 2:28 PM, Tom Lane wrote:
>> Another point here is that it's not clear why we're selecting a
>> known-to-be-insecure default on OS X (where in fact all methods except
>> fsync_writethrough fail to push data to disk). We've been around on
>> that before, of course, and maybe now is not the time to change it.

> Because nobody sane uses OSX on the server?

Some of us would make the same remark about Windows. But we go out of
our way to provide a safe default on that platform anyhow.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:21:50
Message-ID: 4CFEC18E.9050204@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/07/2010 06:11 PM, Tom Lane wrote:
> Josh Berkus<josh(at)agliodbs(dot)com> writes:
>> On 12/7/10 2:28 PM, Tom Lane wrote:
>>> Another point here is that it's not clear why we're selecting a
>>> known-to-be-insecure default on OS X (where in fact all methods except
>>> fsync_writethrough fail to push data to disk). We've been around on
>>> that before, of course, and maybe now is not the time to change it.
>> Because nobody sane uses OSX on the server?
> Some of us would make the same remark about Windows. But we go out of
> our way to provide a safe default on that platform anyhow.
>
>

In practice, though, Windows is used a lot on servers and OSX isn't.
That means we are probably going to have lots less push on this sort of
thing from the OSX community, which is not to say that we shouldn't try
to be just as safe on OSX as we try to be everywhere else.

cheers

andrew


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:22:52
Message-ID: 1291764172.31995.32.camel@jd-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-12-07 at 18:11 -0500, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > On 12/7/10 2:28 PM, Tom Lane wrote:
> >> Another point here is that it's not clear why we're selecting a
> >> known-to-be-insecure default on OS X (where in fact all methods except
> >> fsync_writethrough fail to push data to disk). We've been around on
> >> that before, of course, and maybe now is not the time to change it.
>
> > Because nobody sane uses OSX on the server?
>
> Some of us would make the same remark about Windows. But we go out of
> our way to provide a safe default on that platform anyhow.

Not to mention the assertion that people don't use OSX on a server is
patently false. They don't use it on big servers, but it is very popular
for the SMB (big time capital S).

JD

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:35:04
Message-ID: 940.1291764904@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Some of us would make the same remark about Windows. But we go out of
> our way to provide a safe default on that platform anyhow.

Oh, wait: that's not the case at all. As of recent releases, we support
open_datasync on Windows, and that's the default despite being unsafe.
You have to go and choose some nondefault drive setting to make it safe:
http://developer.postgresql.org/pgdocs/postgres/wal-reliability.html

So if we drop open_datasync from the preference list then Windows users
*will* see a sudden huge change in the default behavior.

Because open_datasync does now exist on Windows, this code in
xlogdefs.h:
#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH
is actually dead code at the moment --- it can never be reached on any
platform.

I am unclear as to the reason why there is a test for
HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also
leftover from a previous vision of how this all works? Or does an
fsync() call actually fail on Windows?

I am now tempted to suggest that HAVE_FSYNC_WRITETHROUGH_ONLY should go
away altogether. The documented and implemented behavior ought to be
that the default is "fdatasync if it exists, else fsync", full stop,
on every platform. On both Windows and OS X, you would need to switch
to fsync_writethrough or change OS-level options to get safe behavior;
which is the same as it is today.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: jd(at)commandprompt(dot)com
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:49:55
Message-ID: 1167.1291765795@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> On Tue, 2010-12-07 at 18:11 -0500, Tom Lane wrote:
>> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>> Because nobody sane uses OSX on the server?

>> Some of us would make the same remark about Windows. But we go out of
>> our way to provide a safe default on that platform anyhow.

> Not to mention the assertion that people don't use OSX on a server is
> patently false. They don't use it on big servers, but it is very popular
> for the SMB (big time capital S).

Actually the previous discussions about this are coming back to me now.
With the current code, we don't actually guarantee safe flush behavior
by default on ANY of the common consumer platforms. In the Linux case
we can't, because we can't monkey with hdparm settings. (I think the
same is true on BSDen ... anybody know?) On Windows and OS X we default
to open_datasync despite its not being safe on either platform. We
previously debated switching those to fsync_writethrough which would
make them safe by default, but decided not to, partly on grounds of the
inevitable ZOMG ITS SLOW backlash and partly on grounds of keeping
cross-platform consistency.

I don't think it's a good idea to reopen the fsync_writethrough debate
right now, certainly not for something we're contemplating
back-patching. I think what we'd better do is ensure that that is
*not* selected as the default, on either Windows or OS X. So we need to
get rid of the HAVE_FSYNC_WRITETHROUGH_ONLY hack.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:52:45
Message-ID: 4CFEC8CD.9000606@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I am unclear as to the reason why there is a test for
> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also
> leftover from a previous vision of how this all works? Or does an
> fsync() call actually fail on Windows?

No, fsync responds fine. It just don't actually sync to disk.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:53:31
Message-ID: 26AAD43D-5512-4F32-BBA0-DEABE46CCDE1@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
> Because nobody sane uses OSX on the server?

The XServe running 10.5 server and 9.0.1 at the other end of the office takes your remark personally. :)

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-07 23:58:44
Message-ID: 1358.1291766324@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> I am unclear as to the reason why there is a test for
>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also
>> leftover from a previous vision of how this all works? Or does an
>> fsync() call actually fail on Windows?

> No, fsync responds fine. It just don't actually sync to disk.

Right, which is also an accurate description of its behavior on OS X,
as well as Linux (if you didn't change hdparm settings). So the real
question here is what's the point of treating Windows differently.

regards, tom lane


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 00:00:47
Message-ID: 87ipz5nnf4.fsf@cbbrowne.afilias-int.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

xof(at)thebuild(dot)com (Christophe Pettus) writes:
> On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
>> Because nobody sane uses OSX on the server?
>
> The XServe running 10.5 server and 9.0.1 at the other end of the
> office takes your remark personally. :)

I'd heard that Apple had cancelled XServe. [Poking back at that...]

Yep, they won't be carrying anything for sale that's particularly
rack-mountable after next January. Not precisely "dead," but definitely
moving on smelling funny...
--
(format nil "~S(at)~S" "cbbrowne" "gmail.com")
http://www3.sympatico.ca/cbbrowne/emacs.html
Photons have mass? I didn't know they were catholic!


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 00:03:49
Message-ID: 4CFECB65.8040601@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Right, which is also an accurate description of its behavior on OS X,
> as well as Linux (if you didn't change hdparm settings). So the real
> question here is what's the point of treating Windows differently.

So, sounds like we should continue treating fsync_writethrough the same
as we have been, and maybe add a doc patch covering some of the above?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 00:13:34
Message-ID: 1623.1291767214@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Right, which is also an accurate description of its behavior on OS X,
>> as well as Linux (if you didn't change hdparm settings). So the real
>> question here is what's the point of treating Windows differently.

> So, sounds like we should continue treating fsync_writethrough the same
> as we have been, and maybe add a doc patch covering some of the above?

Yeah, this patch is shaping up to be about five lines of code change
and a hundred of docs ...

regards, tom lane


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 00:32:19
Message-ID: 1291768339.31995.75.camel@jd-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-12-07 at 19:00 -0500, Chris Browne wrote:
> xof(at)thebuild(dot)com (Christophe Pettus) writes:
> > On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
> >> Because nobody sane uses OSX on the server?
> >
> > The XServe running 10.5 server and 9.0.1 at the other end of the
> > office takes your remark personally. :)
>
> I'd heard that Apple had cancelled XServe. [Poking back at that...]
>
> Yep, they won't be carrying anything for sale that's particularly
> rack-mountable after next January. Not precisely "dead," but definitely
> moving on smelling funny...

A bit off topic but Apple is actually marketing OSX Server on the mini
(with RAID 1). Which honestly for 95% of the businesses out there would
make a very nice, reasonably performant database server for say....
PostBooks, or Drupal.

JD

> --
> (format nil "~S(at)~S" "cbbrowne" "gmail.com")
> http://www3.sympatico.ca/cbbrowne/emacs.html
> Photons have mass? I didn't know they were catholic!
>

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: jd(at)commandprompt(dot)com
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 00:39:39
Message-ID: 4CFED3CB.3000102@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/07/2010 07:32 PM, Joshua D. Drake wrote:
> On Tue, 2010-12-07 at 19:00 -0500, Chris Browne wrote:
>> xof(at)thebuild(dot)com (Christophe Pettus) writes:
>>> On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
>>>> Because nobody sane uses OSX on the server?
>>> The XServe running 10.5 server and 9.0.1 at the other end of the
>>> office takes your remark personally. :)
>> I'd heard that Apple had cancelled XServe. [Poking back at that...]
>>
>> Yep, they won't be carrying anything for sale that's particularly
>> rack-mountable after next January. Not precisely "dead," but definitely
>> moving on smelling funny...
> A bit off topic but Apple is actually marketing OSX Server on the mini
> (with RAID 1). Which honestly for 95% of the businesses out there would
> make a very nice, reasonably performant database server for say....
> PostBooks, or Drupal.
>

Given the constant overheating I have had on my mini, I wouldn't use it
to host anything much ;-) But I guess YMMV.

cheers

andrerw


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 01:07:13
Message-ID: 2412.1291770433@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>> I am unclear as to the reason why there is a test for
>>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also
>>> leftover from a previous vision of how this all works? Or does an
>>> fsync() call actually fail on Windows?

>> No, fsync responds fine. It just don't actually sync to disk.

Sigh ... The closer I look at the Windows code path here, the more of an
inconsistent, badly documented spaghetti-heap it appears to be. So far
as a quick Google search unearths, there is no fsync() primitive on
Windows. What we have actually got is this gem in port/win32.h:

/*
* Even though we don't support 'fsync' as a wal_sync_method,
* we do fsync() a few other places where _commit() is just fine.
*/
#define fsync(fd) _commit(fd)

So actually, there is no difference between selecting fsync and
fsync_writethrough on Windows, this comment and the SGML documentation
to the contrary. Both settings result in invoking _commit() and
presumably are safe. One wonders why we bothered to invent a separate
fsync_writethrough setting on Windows.

What this means is that switching to a simple preference order
"fdatasync, then fsync" will result in choosing fsync on Windows (since
it hasn't got fdatasync), meaning _commit, meaning Windows users see
a behavioral change after all.

I'm afraid that if we don't want a major behavioral change, there's
no option except having a Windows-specific rule for the choice of
default. It'll have to be "fdatasync, then fsync, except on Windows
where open_datasync is the default". Grumble. But it's not like
Windows hasn't got a hundred other special cases already.

Would someone verify via pgbench or similar test (*not* test_fsync) that
on Windows, wal_sync_method = fsync or fsync_writethrough perform the
same (ie tps ~= disk rotation rate) while open_datasync is too fast to
be real? I'm losing confidence that I've found all the spaghetti ends
here, and I don't have a Windows setup to try it myself.

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 06:02:37
Message-ID: 4CFF1F7D.8060306@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> So actually, there is no difference between selecting fsync and
> fsync_writethrough on Windows, this comment and the SGML documentation
> to the contrary. Both settings result in invoking _commit() and
> presumably are safe. One wonders why we bothered to invent a separate
> fsync_writethrough setting on Windows.
>

Quite; I documented some the details about mapping to _commit() long ago
at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm but
forgot to suggest fixing the mistakes in the docs afterwards (Windows is
not exactly my favorite platform).
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php
explains some of the history I think you're looking for here.

> Would someone verify via pgbench or similar test (*not* test_fsync) that
> on Windows, wal_sync_method = fsync or fsync_writethrough perform the
> same (ie tps ~= disk rotation rate) while open_datasync is too fast to
> be real? I'm losing confidence that I've found all the spaghetti ends
> here, and I don't have a Windows setup to try it myself.
>

I can look into this tomorrow. The laptop I posted Ubuntu/RHEL6
test_fsync numbers from before also boots into Vista, so I can compare
all those platforms on the same hardware. I just need to be aware of
the slightly different sequential speeds on each partition of the drive.

As far as your major battle plan goes, I think what we should do is find
the simplest possible patch that just fixes the newer Linux kernel
problem, preferrably without changing any other platform, then commit
that to HEAD and appropriate backports. Then the larger O_DIRECT
remapping can proceed forward after that, along with cleanup to the
writethrough choices and unifying test_fsync against the server. I
wrote a patch that shuffled around a lot of this code last night, but
the first thing I coded was junk because of some mistaken assumptions.
Have been coming to same realizations about how messy this really is you
have.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 06:48:56
Message-ID: AANLkTikGVPfa9KTGdGhEGarw3-2gM8vrdRmBTTTRhfmj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>>> I am unclear as to the reason why there is a test for
>>>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync().  Perhaps that is also
>>>> leftover from a previous vision of how this all works?  Or does an
>>>> fsync() call actually fail on Windows?
>
>>> No, fsync responds fine.  It just don't actually sync to disk.

First of all a warning - I'm writing this on way too little sleep :-)
Blame pgday.eu...

> Sigh ... The closer I look at the Windows code path here, the more of an
> inconsistent, badly documented spaghetti-heap it appears to be.  So far
> as a quick Google search unearths, there is no fsync() primitive on
> Windows.  What we have actually got is this gem in port/win32.h:

Correct.

> /*
>  *      Even though we don't support 'fsync' as a wal_sync_method,
>  *      we do fsync() a few other places where _commit() is just fine.
>  */
> #define fsync(fd) _commit(fd)
>
> So actually, there is no difference between selecting fsync and
> fsync_writethrough on Windows, this comment and the SGML documentation
> to the contrary.  Both settings result in invoking _commit() and
> presumably are safe.  One wonders why we bothered to invent a separate
> fsync_writethrough setting on Windows.

IIRC, using _commit(fd) *is* fsync_writethrough. That's what we
shipped with. It even writes through the cache on a RAID controller
that has BBU'ed write-cache. We had to implement the *other* options
in order to "lower" the safety (it doesn't actually lower the safety
*if* you have a BBU, which is a very good use-case for those options)

> What this means is that switching to a simple preference order
> "fdatasync, then fsync" will result in choosing fsync on Windows (since
> it hasn't got fdatasync), meaning _commit, meaning Windows users see
> a behavioral change after all.

_commit() != fsync()

I think this is the discussion and subsequent changes:

http://archives.postgresql.org/pgsql-patches/2005-03/msg00230.php

> Would someone verify via pgbench or similar test (*not* test_fsync) that
> on Windows, wal_sync_method = fsync or fsync_writethrough perform the
> same (ie tps ~= disk rotation rate) while open_datasync is too fast to
> be real?  I'm losing confidence that I've found all the spaghetti ends
> here, and I don't have a Windows setup to try it myself.

Please note that if you're re-verifying this, verify both on crappy
disk *and* on a proper BBU'ed RAID-controller. The reason for this
originally was that we performed about the same in those two, wihch
made no sense...

Merlin, IIRC you did a lot of the testing around this - do you recall
any more details?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 14:40:04
Message-ID: 14328.1291819204@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> [ win32.h says ]
>> #define fsync(fd) _commit(fd)

>> What this means is that switching to a simple preference order
>> "fdatasync, then fsync" will result in choosing fsync on Windows (since
>> it hasn't got fdatasync), meaning _commit, meaning Windows users see
>> a behavioral change after all.

> _commit() != fsync()

Um, the macro quoted above makes them the same, no? One of us
is confused.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-08 21:18:37
Message-ID: 21250.1291843117@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Given my concerns around exactly what is going on in the Windows code,
I'm now afraid to mess with an all-platforms change to fdatasync as the
preferred default; if we do that it should probably just be in HEAD not
the back branches. So I've come around to the idea that Marti's
proposal of a PLATFORM_DEFAULT_SYNC_METHOD symbol is the best way.
(One reason for adopting that rather than some other way is that it
seems quite likely we'll end up needing it for Windows.)

I haven't touched the documentation yet, but attached is a proposed
code patch against HEAD. This forces the default to fdatasync on Linux,
and makes some cosmetic cleanups around the HAVE_FSYNC_WRITETHROUGH_ONLY
confusion.

regards, tom lane

Attachment Content-Type Size
sync-changes-code-only.patch text/x-patch 3.5 KB

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-10 02:12:05
Message-ID: 4D018C75.7000104@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Since any Windows refactoring has been postponed for now (I'll get back
to performance checks on that platform later), during my testing time
this week instead I did a round of pre-release review of the change Tom
has now committed. All looks good to me, including the docs changes.

I confirmed that:

-Ubuntu system with an older kernel still has the same wal_sync_method
(fdatasync) and performance after pulling the update
-RHEL6 system changes as planned from using open_datasync to fdatasync
once I updated to a HEAD after the commit

On the RHEL6 system, I also checked the commit rate using pgbench with
the attached INSERT only script, rather than relying on test_fsync.
This is 7200 RPM drive, so theoretical max of 120 commits/second, on
ext4; this is the same test setup I described in more detail back in
http://archives.postgresql.org/message-id/4CE2EBF8.4040602@2ndquadrant.com

$ psql -c "show wal_sync_method"
wal_sync_method
-----------------
fdatasync
(1 row)

$ pgbench -i -s 10 pgbench

[gsmith(at)meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6733
tps = 112.208795 (including connections establishing)
tps = 112.216904 (excluding connections establishing)

And then manually switched over to test performance of the troublesome
old default:

[gsmith(at)meddle ~]$ psql -c "show wal_sync_method"
wal_sync_method
-----------------
open_datasync

[gsmith(at)meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6672
tps = 111.185802 (including connections establishing)
tps = 111.195089 (excluding connections establishing)

This is interesting, because test_fsync consistently reported a rate of
about half this when using open_datasync instead of the equal
performance I'm getting from the database. I'll see if I can reproduce
that further, but it's no reason to be concerned about the change that's
been made I think. Just more evidence that test_fsync has quirks left
to be sorted out. But that's not backbranch material, it should be part
of 9.1 only refactoring, already in progress via the patch Josh
submitted. There's a bit of time left to get that done.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment Content-Type Size
insert.sql text/x-sql 302 bytes

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-10 02:47:46
Message-ID: 4D0194D2.3030102@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg,

> This is interesting, because test_fsync consistently reported a rate of
> about half this when using open_datasync instead of the equal
> performance I'm getting from the database. I'll see if I can reproduce
> that further, but it's no reason to be concerned about the change that's
> been made I think. Just more evidence that test_fsync has quirks left
> to be sorted out. But that's not backbranch material, it should be part
> of 9.1 only refactoring, already in progress via the patch Josh
> submitted. There's a bit of time left to get that done.

Did you rerun test_sync with O_DIRECT entabled, using my patch? The
figures you had from test_fsync earlier were without O_DIRECT.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-10 02:50:56
Message-ID: 4D019590.7090005@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> Did you rerun test_sync with O_DIRECT entabled, using my patch? The
> figures you had from test_fsync earlier were without O_DIRECT.
>

No--I was just focused on testing the changes that had already been
committed. The use of O_DIRECT in the server but not test_fsync could
very well be the reason for the difference; don't know yet. I'm trying
to get through this CF before I start getting distracted by newer
patches, I'll get to yours soon I hope.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Final(?) proposal for wal_sync_method changes
Date: 2010-12-11 13:27:37
Message-ID: AANLkTinZR_N4HgEJ2SH7ZWx-w6Qvrzwm__yAL__f8=rg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 8, 2010 at 15:40, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>> On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> [ win32.h says ]
>>> #define fsync(fd) _commit(fd)
>
>>> What this means is that switching to a simple preference order
>>> "fdatasync, then fsync" will result in choosing fsync on Windows (since
>>> it hasn't got fdatasync), meaning _commit, meaning Windows users see
>>> a behavioral change after all.
>
>> _commit() != fsync()
>
> Um, the macro quoted above makes them the same, no?  One of us
> is confused.

Uh, yeah. Sorry, that was the unclear:ness from being too preoccupied
with the conference.. Pretty sure I'm the confused one.
.
_commit() is definitely the same as fsync() on the API level. And this
correspond to fsync_writethrough, not fsync, when you talk about the
wal_sync_method parameter. It will always sync through the write
cache, even if it's hardware BBU'ed cache.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: New test_fsync messages for direct I/O
Date: 2011-01-16 00:15:04
Message-ID: 201101160015.p0G0F4113329@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> Greg,
>
> > This is interesting, because test_fsync consistently reported a rate of
> > about half this when using open_datasync instead of the equal
> > performance I'm getting from the database. I'll see if I can reproduce
> > that further, but it's no reason to be concerned about the change that's
> > been made I think. Just more evidence that test_fsync has quirks left
> > to be sorted out. But that's not backbranch material, it should be part
> > of 9.1 only refactoring, already in progress via the patch Josh
> > submitted. There's a bit of time left to get that done.
>
> Did you rerun test_sync with O_DIRECT entabled, using my patch? The
> figures you had from test_fsync earlier were without O_DIRECT.

I have modified test_fsync with the attached, applied patch to report
cases where we are testing without O_DIRECT when only O_DIRECT would be
used by the server, and cases where O_DIRECT fails because of the file
system type. Josh Berkus wanted the first case kept in case we decide
to offer non-direct-io options on machines that support direct i/o.

The new messages are:

* This non-direct I/O option is not used by Postgres.

** This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

You can see the first one below in my output from Ubuntu:

$ ./test_fsync
Ops-per-test = 2000

Simple non-sync'ed write:
8k write 58.175 ops/sec

Compare file sync methods using one write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
8k write, fdatasync 68.425 ops/sec
8k write, fsync 63.932 ops/sec
fsync_writethrough n/a
open_sync 8k write* 73.785 ops/sec
open_sync 8k direct I/O write 82.929 ops/sec
* This non-direct I/O option is not used by Postgres.

Compare file sync methods using two writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
8k write, 8k write, fdatasync 42.728 ops/sec
8k write, 8k write, fsync 43.625 ops/sec
fsync_writethrough n/a
2 open_sync 8k writes* 37.150 ops/sec
2 open_sync 8k direct I/O writes 43.722 ops/sec
* This non-direct I/O option is not used by Postgres.

Compare open_sync with different sizes:
(This is designed to compare the cost of one large
sync'ed write and two smaller sync'ed writes.)
open_sync 16k write 46.428 ops/sec
2 open_sync 8k writes 38.703 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 65.744 ops/sec
8k write, close, fsync 63.077 ops/sec

I believe test_fsync now matches the backend code. If we decide to
change things, it can be adjusted.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachment Content-Type Size
/rtmp/stars text/x-diff 3.7 KB