We really ought to do something about O_DIRECT and data=journalled on ext4

Lists: pgsql-hackers
From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 02:55:58
Message-ID: 4CF5B93E.50107@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

Some of you might already be aware that this combination produces a
fatal startup crash in PostgreSQL:

1. Create an Ext3 or Ext4 partition and mount it with data=journal on a
server with linux kernel 2.6.30 or later.
2. Initdb a PGDATA on that partition
3. Start PostgreSQL with the default config from that PGDATA

This was reported a ways back:
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=567113

To explain: calling O_DIRECT on an ext3 or ext4 partition with
data=journalled causes a crash. However, recent Linux kernels now
report support for O_DIRECT when we compile PostgreSQL, so we use it by
default. This results in a "crash by default" situation with new
Linuxes if anyone sets data=journal.

We just encountered this again with another user. With RHEL6 out now,
this seems likely to become a fairly common crash report.

Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 03:09:01
Message-ID: 12232.1291172941@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 03:13:11
Message-ID: 4CF5BD47.1040501@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/30/10 7:09 PM, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?
>
> We should wait for the outcome of the discussion about whether to change
> the default wal_sync_method before worrying about this.

Are we considering backporting that change?

If so, this would be another argument in favor of changing the default.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 03:25:22
Message-ID: 4CF5C022.4050302@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/30/2010 10:09 PM, Tom Lane wrote:
> Josh Berkus<josh(at)agliodbs(dot)com> writes:
>> Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?
> We should wait for the outcome of the discussion about whether to change
> the default wal_sync_method before worrying about this.
>
>

Tom,

we've just had a significant PGX customer encounter this with the latest
Postgres on Redhat's freshly released flagship product. Presumably the
default wal_sync_method will only change prospectively. But this will
feel to every user out there who encounters it like a bug in our code,
and it needs attention. It was darn difficult to diagnose, and many
people will just give up in disgust if they encounter it.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 04:17:05
Message-ID: 28907.1291177025@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 11/30/2010 10:09 PM, Tom Lane wrote:
>> We should wait for the outcome of the discussion about whether to change
>> the default wal_sync_method before worrying about this.

> we've just had a significant PGX customer encounter this with the latest
> Postgres on Redhat's freshly released flagship product. Presumably the
> default wal_sync_method will only change prospectively.

I don't think so. The fact that Linux is changing underneath us is a
compelling reason for back-patching a change here. Our older branches
still have to be able to run on modern OS versions. I'm also fairly
unclear on what you think a fix would look like if it's not effectively
a change in the default.

(Hint: this *will* be changing, one way or another, in Red Hat's version
of 8.4, since that's what RH is shipping in RHEL6.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 05:31:34
Message-ID: 6263.1291181494@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> On 11/30/10 7:09 PM, Tom Lane wrote:
>> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>> Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?
>>
>> We should wait for the outcome of the discussion about whether to change
>> the default wal_sync_method before worrying about this.

> Are we considering backporting that change?

> If so, this would be another argument in favor of changing the default.

Well, no, actually it's the same (only) argument. We'd never consider
back-patching such a change if our hand weren't being forced by kernel
changes :-(

As things stand, though, I think the only thing that's really open for
discussion is how wide to make the scope of the default-change: should
we just do it across the board, or try to limit it to some subset of the
platforms where open_datasync is currently the default. And that's a
decision that ought to be informed by some performance testing.

regards, tom lane


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 10:35:50
Message-ID: 874oax6ard.fsf@hi-media-techno.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> As things stand, though, I think the only thing that's really open for
> discussion is how wide to make the scope of the default-change: should
> we just do it across the board, or try to limit it to some subset of the
> platforms where open_datasync is currently the default. And that's a
> decision that ought to be informed by some performance testing.

Maybe I have a distorded view of the situation for having hit the
problem with an ubuntu upgrade, but it really does not look like a
performance item to me.

PANIC: could not open file "pg_xlog/000000010000000000000001" (log file 0, segment 1): Invalid argument

It took me quite some time to be able to start my development cluster
again and validate some new patch to send to the list.

Now I understand that you want to test the other alternatives before to
choose among those which work, but my opinion is that it should be fixed
in HEAD before next alpha, or even ASAP. It could be that a HINT here
would be enough for contributors not to lose to much time. It would be

HINT: if you're running linux, please try to change wal_sync_method,
open_datasync is not reliable anymore in recent kernels. An example of
trustworthy setting is fdatasync.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 11:46:45
Message-ID: AANLkTim4o_+odFB4SqDkRrUEgZpRiXSvnP0twh1641CS@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 1, 2010 at 12:35, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr> wrote:
> PANIC:  could not open file "pg_xlog/000000010000000000000001" (log file 0, segment 1): Invalid argument

+1 I got the same error when trying to get PostgreSQL working on tmpfs
and gave up.

> Now I understand that you want to test the other alternatives before to
> choose among those which work, but my opinion is that it should be fixed
> in HEAD before next alpha, or even ASAP.

It's queued for this month's commitfest, so things are moving.

https://commitfest.postgresql.org/action/patch_view?id=432

Regards,
Marti


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 13:50:36
Message-ID: AANLkTimvsoBFnLj97x3BGYrVfaXH4MxDnD+BAXirVJ_=@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 1, 2010 at 12:31 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> On 11/30/10 7:09 PM, Tom Lane wrote:
>>> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>>> Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?
>>>
>>> We should wait for the outcome of the discussion about whether to change
>>> the default wal_sync_method before worrying about this.
>
>> Are we considering backporting that change?
>
>> If so, this would be another argument in favor of changing the default.
>
> Well, no, actually it's the same (only) argument.  We'd never consider
> back-patching such a change if our hand weren't being forced by kernel
> changes :-(
>
> As things stand, though, I think the only thing that's really open for
> discussion is how wide to make the scope of the default-change: should
> we just do it across the board, or try to limit it to some subset of the
> platforms where open_datasync is currently the default.  And that's a
> decision that ought to be informed by some performance testing.

If we could get a clear idea of what performance testing needs to be
done, I suspect we could find some people willing to do it. What do
you think would be useful?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 14:00:13
Message-ID: 4CF654ED.2010806@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/30/2010 11:17 PM, Tom Lane wrote:
> Andrew Dunstan<andrew(at)dunslane(dot)net> writes:
>> On 11/30/2010 10:09 PM, Tom Lane wrote:
>>> We should wait for the outcome of the discussion about whether to change
>>> the default wal_sync_method before worrying about this.
>> we've just had a significant PGX customer encounter this with the latest
>> Postgres on Redhat's freshly released flagship product. Presumably the
>> default wal_sync_method will only change prospectively.
> I don't think so. The fact that Linux is changing underneath us is a
> compelling reason for back-patching a change here. Our older branches
> still have to be able to run on modern OS versions. I'm also fairly
> unclear on what you think a fix would look like if it's not effectively
> a change in the default.
>
> (Hint: this *will* be changing, one way or another, in Red Hat's version
> of 8.4, since that's what RH is shipping in RHEL6.)
>
>

Well, my initial idea was that if PG_O_DIRECT is non-zero, we should
test at startup time if we can use it on the WAL file system and inhibit
its use if not.

Incidentally, I notice it's not used at all in test_fsync.c - should it
not be?

cheers

andrew


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 17:58:24
Message-ID: 4CF68CC0.2080404@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> Well, no, actually it's the same (only) argument. We'd never consider
> back-patching such a change if our hand weren't being forced by kernel
> changes :-(

I think we have to back-patch the change. The way it is now, a DBA who
thinks they are doing normal sensible configuration can cause PostgreSQL
to fail to restart. Imagine this scenario, for example:

1) DBA, using PostgreSQL 8.3, gets worried about possible disk issues
2) DBA changes their single Ext3/4 partition to "data=journal"
3) DBA restarts system
4) PostgreSQL won't start
5) DBA thrashes around for a few hours while the site is down
6) DBA gets fired and the new DBA migrates to some other DBMS.

I simply can't think of *anywhere* we could put the information about
opensync and Linux/Ext which would be prominent enough to avoid the
above scenario. And per replies, a lot of people have hit this issue
already.

It's a bug and it's our bug. Back when we added O_DIRECT, we assumed
that support for O_DIRECT/opensync could be determined on an OS/kernel
basis, because that was the information we had. Now it turns out that
support can vary *by filesystem* and *between remounts*. We didn't have
any way of knowing different back in 2004, but that doesn't mean we
don't need to fix our mistaken assumption now.

Ideally, we would change our code to test support for O_DIRECT on
startup, rather than at compile time, and backport *that*.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 18:09:05
Message-ID: 19015.1291226945@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> It's a bug and it's our bug.

No, it's a filesystem bug that this particular filesystem doesn't
support a perfectly reasonable combination of options, and doesn't
even fail gracefully as it could easily do. But assigning blame
doesn't help much.

> Back when we added O_DIRECT, we assumed
> that support for O_DIRECT/opensync could be determined on an OS/kernel
> basis, because that was the information we had. Now it turns out that
> support can vary *by filesystem* and *between remounts*. We didn't have
> any way of knowing different back in 2004, but that doesn't mean we
> don't need to fix our mistaken assumption now.

> Ideally, we would change our code to test support for O_DIRECT on
> startup, rather than at compile time, and backport *that*.

I'm not convinced that a startup-time test would be enough either,
since as you note a remount might be enough to change the situation.

I think the best answer is to get out of the business of using
O_DIRECT by default, especially seeing that available evidence
suggests it might not be a performance win anyway.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 18:19:40
Message-ID: 4CF691BC.4050602@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I think the best answer is to get out of the business of using
> O_DIRECT by default, especially seeing that available evidence
> suggests it might not be a performance win anyway.

Well, we don't have any performance evidence ... there's an issue with
the fsync-test script which causes it not to use O_DIRECT.

However, we haven't seen any evidence for benefits on any production
filesystem, either. So given the lack of evidence of performance
benefit, combined with the definite evidence of related failures, I
agree that simply disabling O_DIRECT by default would be a good way to
solve this.

It might be nice to add new sync_method options, "osync_odirect" and
"odatasync_odirect" for DBAs who think they know enough to tune with
non-defaults.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 18:41:46
Message-ID: 201012011941.46965.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday 01 December 2010 19:09:05 Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > It's a bug and it's our bug.
>
> No, it's a filesystem bug that this particular filesystem doesn't
> support a perfectly reasonable combination of options, and doesn't
> even fail gracefully as it could easily do. But assigning blame
> doesn't help much.
I wouldnt call it a reasonable combination - promising fs-level data-
journaling (data=journal) and O_DIRECT are not really compatible with each
other...

Andres


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 18:53:09
Message-ID: 19633.1291229589@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> It might be nice to add new sync_method options, "osync_odirect" and
> "odatasync_odirect" for DBAs who think they know enough to tune with
> non-defaults.

That would have the benefit that we'd not have to argue with people
who liked the current behavior (assuming there are any). I'm not
sure there's much technical advantage, but from a political standpoint
it might be the easiest sort of change to push through.

However, this doesn't really address the question of what a sensible
choice of default is. If there's little evidence about whether the
current flavor of open_datasync is really the fastest way, there's
none whatsoever that establishes open_datasync_without_o_direct
being a sane choice of default.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 19:00:25
Message-ID: 4CF69B49.2000904@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> However, this doesn't really address the question of what a sensible
> choice of default is. If there's little evidence about whether the
> current flavor of open_datasync is really the fastest way, there's
> none whatsoever that establishes open_datasync_without_o_direct
> being a sane choice of default.

No, I'd switch to fdatasync. That's the performance that most people
are familiar with anyway, since it was all Linux supported before.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 19:03:38
Message-ID: 4CF69C0A.8020701@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/01/2010 01:41 PM, Andres Freund wrote:
> On Wednesday 01 December 2010 19:09:05 Tom Lane wrote:
>> Josh Berkus<josh(at)agliodbs(dot)com> writes:
>>> It's a bug and it's our bug.
>> No, it's a filesystem bug that this particular filesystem doesn't
>> support a perfectly reasonable combination of options, and doesn't
>> even fail gracefully as it could easily do. But assigning blame
>> doesn't help much.
> I wouldnt call it a reasonable combination - promising fs-level data-
> journaling (data=journal) and O_DIRECT are not really compatible with each
> other...
>
>

OK, but how is an application supposed to know that data journaling is
set. Postgres doesn't even look at the FS type, let alone the mount
options. From the app's POV it's perfectly reasonable. If the OS is
going to provide the API, it should expect people to use it.

cheers

andrew


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-01 22:48:05
Message-ID: 4CF6D0A5.8080501@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I think the best answer is to get out of the business of using
> O_DIRECT by default, especially seeing that available evidence
> suggests it might not be a performance win anyway.
>

I was concerned that open_datasync might be doing a better job of
forcing data out of drive write caches. But the tests I've done on
RHEL6 so far suggest that's not true; the write guarantees seem to be
the same as when using fdatasync. And there's certainly one performance
regression possible going from fdatasync to open_datasync, the case
where you're overflowing wal_buffers before you actually commit.

Below is a test of the troublesome behavior on the same RHEL6 system I
gave test_fsync performance test results from at
http://archives.postgresql.org/message-id/4CE2EBF8.4040602@2ndquadrant.com

This confirms that the kernel now defining O_DSYNC behavior as being
available, but not actually supporting it when running the filesystem in
journaled mode, is the problem here. That's clearly a kernel bug and no
fault of PostgreSQL, it's just never been exposed in a default
configuration before. The RedHat bugzilla report seems a bit unclear
about what's going on here, may be worth updating that to note the
underlying cause.

Regardless, I'm now leaning heavily toward the idea of avoiding
open_datasync by default given this bug, and backpatching that change to
at least 8.4. I'll do some more database-level performance tests here
just as a final sanity check on that. My gut feel is now that we'll
eventually be taking something like Marti's patch, adding some more
documentation around it, and applying that to HEAD as well as some
number of back branches.

$ mount | head -n 1
/dev/sda7 on / type ext4 (rw)
$ cat $PGDATA/postgresql.conf | grep wal_sync_method
#wal_sync_method = fdatasync # the default is the first option
$ pg_ctl start
server starting
LOG: database system was shut down at 2010-12-01 17:20:16 EST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
$ psql -c "show wal_sync_method"
wal_sync_method
-----------------
open_datasync

[Edit /etc/fstab, change mount options to be "data=journal" and reboot]

$ mount | grep journal
/dev/sda7 on / type ext4 (rw,data=journal)
$ cat postgresql.conf | grep wal_sync_method
#wal_sync_method = fdatasync # the default is the first option
$ pg_ctl start
server starting
LOG: database system was shut down at 2010-12-01 12:14:50 EST
PANIC: could not open file "pg_xlog/000000010000000000000001" (log file
0, segment 1): Invalid argument
LOG: startup process (PID 2690) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
$ pg_ctl stop

$ vi $PGDATA/postgresql.conf
$ cat $PGDATA/postgresql.conf | grep wal_sync_method
wal_sync_method = fdatasync # the default is the first option
$ pg_ctl start
server starting
LOG: database system was shut down at 2010-12-01 12:14:40 EST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-02 23:58:36
Message-ID: 201012022358.oB2NwaH24037@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
>
>
> On 11/30/2010 11:17 PM, Tom Lane wrote:
> > Andrew Dunstan<andrew(at)dunslane(dot)net> writes:
> >> On 11/30/2010 10:09 PM, Tom Lane wrote:
> >>> We should wait for the outcome of the discussion about whether to change
> >>> the default wal_sync_method before worrying about this.
> >> we've just had a significant PGX customer encounter this with the latest
> >> Postgres on Redhat's freshly released flagship product. Presumably the
> >> default wal_sync_method will only change prospectively.
> > I don't think so. The fact that Linux is changing underneath us is a
> > compelling reason for back-patching a change here. Our older branches
> > still have to be able to run on modern OS versions. I'm also fairly
> > unclear on what you think a fix would look like if it's not effectively
> > a change in the default.
> >
> > (Hint: this *will* be changing, one way or another, in Red Hat's version
> > of 8.4, since that's what RH is shipping in RHEL6.)
> >
> >
>
> Well, my initial idea was that if PG_O_DIRECT is non-zero, we should
> test at startup time if we can use it on the WAL file system and inhibit
> its use if not.
>
> Incidentally, I notice it's not used at all in test_fsync.c - should it
> not be?

test_fsync certainly should be using PG_O_DIRECT in the same places the
backend does. Once we decide how to handle PG_O_DIRECT, I will modify
test_fsync to match.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-03 19:55:23
Message-ID: 4CF94B2B.7060001@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

All,

So, I've been doing some reading about this issue, and I think
regardless of what other changes we make we should never enable O_DIRECT
automatically on Linux, and it was a mistake for us to do so in the
first place.

First, in the Linux docs for open():

=========

In summary, O_DIRECT is a potentially powerful tool that should be used
with caution. It is recommended that applications treat use of O_DIRECT
as a performance option which is disabled by default.

=========

Second, Linus has a quote about O_DIRECT that I think should serve as an
indicator to us that directIO will never be beneficial-by-default on
Linux, and might even someday be desupported:

============

The right way to do it is to just not use O_DIRECT.

The whole notion of "direct IO" is totally braindamaged. Just say no.

This is your brain: O
This is your brain on O_DIRECT: .

Any questions?

I should have fought back harder. There really is no valid reason for EVER
using O_DIRECT. You need a buffer whatever IO you do, and it might as well
be the page cache. There are better ways to control the page cache than
play games and think that a page cache isn't necessary.

So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
instead.

Linus
=============

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-03 20:02:24
Message-ID: 4CF94CD0.1040004@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03.12.2010 21:55, Josh Berkus wrote:
> All,
>
> So, I've been doing some reading about this issue, and I think
> regardless of what other changes we make we should never enable O_DIRECT
> automatically on Linux, and it was a mistake for us to do so in the
> first place.
>
> First, in the Linux docs for open():

The quote on that man page is hilarious:

"The thing that has always disturbed me about O_DIRECT is that
the whole interface is just stupid, and was probably designed by
a deranged monkey on some serious mind-controlling substances."
-- Linus

I agree we should not enable it by default. If it's faster on some
circumstances, the admin is free to do the research and enable it, but
defaults need to be safe above all.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-06 20:53:42
Message-ID: 1909.1291668822@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Regardless, I'm now leaning heavily toward the idea of avoiding
> open_datasync by default given this bug, and backpatching that change to
> at least 8.4. I'll do some more database-level performance tests here
> just as a final sanity check on that. My gut feel is now that we'll
> eventually be taking something like Marti's patch, adding some more
> documentation around it, and applying that to HEAD as well as some
> number of back branches.

I think we have got consensus that (1) open_datasync should not be the
default on Linux, and (2) this change needs to be back-patched. What
is not clear to me is whether we have consensus to change the option
preference order globally, or restrict the change to just be effective
on Linux. The various testing that's been reported so far is all for
Linux and thus doesn't directly address the question of whether other
kernels will have similar performance properties. However, it seems
reasonable to me to suppose that open_datasync could only be a win in
very restricted scenarios and thus shouldn't be a preferred default.
Also, I dread trying to document the behavior if the preference order
becomes platform-dependent.

With the holidays fast approaching, our window to do something about
this in a timely fashion grows short. If we don't schedule update
releases to be made this week, I think we're looking at not getting the
updates out till after New Year's. Do we want to wait that long? Is
anyone actually planning to do performance testing that would prove
anything about non-Linux platforms?

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-06 23:56:26
Message-ID: 4CFD782A.5020205@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> The various testing that's been reported so far is all for
> Linux and thus doesn't directly address the question of whether other
> kernels will have similar performance properties.

Survey of some popular platforms:

Linux: don't want O_DIRECT by default for reliability reasons, and
there's no clear performance win in the default config with small
wal_buffers

Solaris: O_DIRECT doesn't work, there's another API support has never
been added for; see
http://blogs.sun.com/jkshah/entry/postgresql_wal_sync_method_and

Windows: Small reported gains for O_DIRECT, i.e 10% at
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01615.php

FreeBSD: It probably works there, but I've never seen good performance
tests of it on this platform.

Mac OS X: Like Solaris, there's a similar mechanism but it's not
O_DIRECT; see
http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag
for notes about the F_NOCACHE feature used. Same basic situation as
Solaris; there's an API, but PostgreSQL doesn't use it yet.

So my guess is that some small percentage of Windows users might notice
a change here, and some testing on FreeBSD would be useful too. That's
about it for platforms that I think anybody needs to worry about.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Steve Singer <ssinger(at)ca(dot)afilias(dot)info>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 00:43:15
Message-ID: 4CFD8323.6030204@ca.afilias.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10-12-06 06:56 PM, Greg Smith wrote:
> Tom Lane wrote:
>> The various testing that's been reported so far is all for
>> Linux and thus doesn't directly address the question of whether other
>> kernels will have similar performance properties.
>
> Survey of some popular platforms:
>

<snip>

> So my guess is that some small percentage of Windows users might notice
> a change here, and some testing on FreeBSD would be useful too. That's
> about it for platforms that I think anybody needs to worry about.

If you tell me which options to pgbench and which .conf file settings
you'd like to see I can probably arrange to run some tests on AIX.

>
> --
> Greg Smith 2ndQuadrant USgreg(at)2ndQuadrant(dot)com Baltimore, MD
> PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance":http://www.2ndQuadrant.com/books
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 01:34:33
Message-ID: 5875.1291685673@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> So my guess is that some small percentage of Windows users might notice
> a change here, and some testing on FreeBSD would be useful too. That's
> about it for platforms that I think anybody needs to worry about.

To my mind, O_DIRECT is not really the key issue here, it's whether to
prefer O_DSYNC or fdatasync. I looked back in the archives, and I think
that the main reason we prefer O_DSYNC when available is the results
I got here:

http://archives.postgresql.org/pgsql-hackers/2001-03/msg00381.php

which demonstrated a performance benefit on HPUX 10.20, though with a
test tool much more primitive than test_fsync. I still have that
machine, although the disk that was in it at the time died awhile back.
What's in there now is a Seagate ST336607LW spinning at 10000 RPM (166
rev/sec) and today I get numbers like this from test_fsync:

Simple write:
8k write 28331.020/second

Compare file sync methods using one write:
open_datasync 8k write 161.190/second
open_sync 8k write 156.478/second
8k write, fdatasync 54.302/second
8k write, fsync 51.810/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 81.702/second
2 open_sync 8k writes 80.172/second
8k write, 8k write, fdatasync 40.829/second
8k write, 8k write, fsync 39.836/second

Compare open_sync with different sizes:
open_sync 16k write 80.192/second
2 open_sync 8k writes 78.018/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 52.527/second
8k write, close, fsync 54.092/second

So *on that rather ancient platform* there's a measurable performance
benefit to O_DSYNC, but this seems to be largely because fdatasync is
stubbed to fsync in userspace rather than because fdatasync wouldn't
be a better idea in the abstract. Also, a lot of the argument against
fsync at the time was that it forced the kernel to iterate through all
the buffers for the WAL file to see if any were dirty. I would imagine
that modern kernels are a tad smarter about that; and even if they
aren't, the CPU speed versus disk speed tradeoff has changed enough
since 2001 that iterating through 16MB of buffers isn't as interesting
as it was then.

So to my mind, switching to the preference order fdatasync,
fsync_writethrough, fsync seems like the thing to do. Since we assume
fsync is always available, that means that O_DSYNC/O_SYNC will not be
the defaults on any platform.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 02:00:24
Message-ID: 4CFD9538.8080509@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Steve,

> If you tell me which options to pgbench and which .conf file settings
> you'd like to see I can probably arrange to run some tests on AIX.

Compile and run test_fsync in PGSRC/src/tools/fsync.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 02:04:02
Message-ID: 4CFD9612.9030006@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Mac OS X: Like Solaris, there's a similar mechanism but it's not
> O_DIRECT; see
> http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag
> for notes about the F_NOCACHE feature used. Same basic situation as
> Solaris; there's an API, but PostgreSQL doesn't use it yet.

Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available.
>From my run, it looks like even so regular fsync might be better than
open_sync. Results from a MacBook:

Sidney-Stratton:fsync josh$ ./test_fsync
Loops = 10000

Simple write:
8k write 2121.004/second

Compare file sync methods using one write:
(open_datasync unavailable)
open_sync 8k write 1993.833/second
(fdatasync unavailable)
8k write, fsync 1878.154/second

Compare file sync methods using two writes:
(open_datasync unavailable)
2 open_sync 8k writes 1005.009/second
(fdatasync unavailable)
8k write, 8k write, fsync 1709.862/second

Compare open_sync with different sizes:
open_sync 16k write 1728.803/second
2 open_sync 8k writes 969.416/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 1772.572/second
8k write, close, fsync 1939.897/second

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 02:05:28
Message-ID: AANLkTi=nf+Mdn7bdoziVowXuMZrNPzfwb3ZncF-Q98QC@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> Mac OS X:  Like Solaris, there's a similar mechanism but it's not
>> O_DIRECT; see
>> http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag
>> for notes about the F_NOCACHE  feature used.  Same basic situation as
>> Solaris; there's an API, but PostgreSQL doesn't use it yet.
>
> Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available.
> From my run, it looks like even so regular fsync might be better than
> open_sync.

But I think you need to use fsync_writethrough if you actually want durability.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 02:10:13
Message-ID: 6815.1291687813@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available.
>> From my run, it looks like even so regular fsync might be better than
>> open_sync.

> But I think you need to use fsync_writethrough if you actually want durability.

Yeah. Unless your laptop contains an SSD, those numbers are garbage on
their face. So that's another problem with test_fsync: it omits
fsync_writethrough.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 02:12:25
Message-ID: 4CFD9809.20608@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/6/10 6:10 PM, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available.
>>> From my run, it looks like even so regular fsync might be better than
>>> open_sync.
>
>> But I think you need to use fsync_writethrough if you actually want durability.
>
> Yeah. Unless your laptop contains an SSD, those numbers are garbage on
> their face. So that's another problem with test_fsync: it omits
> fsync_writethrough.

Yeah, the issue with test_fsync appears to be that it's designed to work
without os-specific switches no matter what, not to accurately reflect
how we access wal.

I'll see if I can do better.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 02:32:48
Message-ID: 4CFD9CD0.6010809@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

All,

Geirth's results from his FreeBSD 7.1 server using 8.4's test_fsync:

Simple write timing:
write 0.007081

Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
write, fsync, close 5.937933
write, close, fsync 8.056394

Compare one o_sync write to two:
one 16k o_sync write 7.366927
two 8k o_sync writes 15.299300

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 7.512682
(fdatasync unavailable)
write, fsync 5.856480

Compare file sync methods with two 8k writes:
(o_dsync unavailable)
open o_sync, write 15.472910
(fdatasync unavailable)
write, fsync 5.880319

... again, open_sync does not look very impressive.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Steve Singer <ssinger(at)ca(dot)afilias(dot)info>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-07 18:16:57
Message-ID: 4CFE7A19.3080306@ca.afilias.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10-12-06 09:00 PM, Josh Berkus wrote:
> Steve,
>
>> If you tell me which options to pgbench and which .conf file settings
>> you'd like to see I can probably arrange to run some tests on AIX.
>
> Compile and run test_fsync in PGSRC/src/tools/fsync.
>

Attached are runs against two different disk sub-systems from a server
running AIX 5.3.

The first one is against the local disks

Loops = 10000

Simple write:
8k write 60812.454/second

Compare file sync methods using one write:
open_datasync 8k write 162.160/second
open_sync 8k write 158.472/second
8k write, fdatasync 158.157/second
8k write, fsync 45.382/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 79.472/second
2 open_sync 8k writes 80.095/second
8k write, 8k write, fdatasync 159.268/second
8k write, 8k write, fsync 44.725/second

Compare open_sync with different sizes:
open_sync 16k write 162.017/second
2 open_sync 8k writes 79.709/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 45.361/second
8k write, close, fsync 36.311/second

================================

The below profile is from the same machine using an IBM DS 6800 SAN for
storage.

Loops = 10000

Simple write:
8k write 75933.027/second

Compare file sync methods using one write:
open_datasync 8k write 2762.801/second
open_sync 8k write 2453.822/second
8k write, fdatasync 2867.331/second
8k write, fsync 1094.048/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 1287.845/second
2 open_sync 8k writes 1332.084/second
8k write, 8k write, fdatasync 1966.411/second
8k write, 8k write, fsync 1048.354/second

Compare open_sync with different sizes:
open_sync 16k write 2281.425/second
2 open_sync 8k writes 1401.561/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 1298.404/second
8k write, close, fsync 1188.582/second


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-08 15:14:18
Message-ID: AANLkTin4vUCB9Ssb6Ywi80_fHR8dLPz=PqsS72aEjp05@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 7, 2010 at 03:34, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> To my mind, O_DIRECT is not really the key issue here, it's whether to
> prefer O_DSYNC or fdatasync.

Since different platforms implement these primitives differently, and
it's not always clear from the header file definitions which options
are actually implemented, how about simply hard-coding a default value
for each platform?

1. This would be quite straightforward to code and document (a table
of platforms and their default wal_sync_method setting)

2. The best performing (or safest) method can be chosen on every
platform. From the above discussion it seems that Windows and OSX
should default to fdatasync_writethrough even if other methods are
available

3. It would pre-empt similar surprises if other platforms change their
header files, like what happened on Linux now.

Sounds like the simple and foolproof solution.

Regards,
Marti


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2010-12-08 15:36:21
Message-ID: 15299.1291822581@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marti Raudsepp <marti(at)juffo(dot)org> writes:
> On Tue, Dec 7, 2010 at 03:34, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> To my mind, O_DIRECT is not really the key issue here, it's whether to
>> prefer O_DSYNC or fdatasync.

> Since different platforms implement these primitives differently, and
> it's not always clear from the header file definitions which options
> are actually implemented, how about simply hard-coding a default value
> for each platform?

There's not a fixed finite list of "platforms we support". In general
we prefer to avoid designing things that way at all. If we have to have
specific exceptions for specific platforms, we grin and bear it, but for
the most part behavioral differences ought to be driven by configure's
probes for platform features.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date: 2011-03-11 01:25:52
Message-ID: 201103110125.p2B1Prd19420@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> On 12/6/10 6:10 PM, Tom Lane wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >>> Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available.
> >>> From my run, it looks like even so regular fsync might be better than
> >>> open_sync.
> >
> >> But I think you need to use fsync_writethrough if you actually want durability.
> >
> > Yeah. Unless your laptop contains an SSD, those numbers are garbage on
> > their face. So that's another problem with test_fsync: it omits
> > fsync_writethrough.
>
> Yeah, the issue with test_fsync appears to be that it's designed to work
> without os-specific switches no matter what, not to accurately reflect
> how we access wal.

I have now modified pg_test_fsync to use O_DIRECT for O_SYNC/O_FSYNC,
and O_DSYNC, if supported, so it now matches how we use WAL (except we
don't use O_DIRECT when in 'archive' and 'hot standby' mode). Applied
patch attached.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachment Content-Type Size
/rtmp/fsync.diff text/x-diff 8.0 KB