Re: Use of O_DIRECT only for open_* sync options

Lists: pgsql-hackers
From: Bruce Momjian <bruce(at)momjian(dot)us>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Use of O_DIRECT only for open_* sync options
Date: 2011-01-19 18:53:14
Message-ID: 201101191853.p0JIrEn15002@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Is there a reason we only use O_DIRECT with open_* sync options?
xlogdefs.h says:

/*
* Because O_DIRECT bypasses the kernel buffers, and because we never
* read those buffers except during crash recovery, it is a win to use
* it in all cases where we sync on each write(). We could allow O_DIRECT
* with fsync(), but because skipping the kernel buffer forces writes out
* quickly, it seems best just to use it for O_SYNC. It is hard to imagine
* how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
* Also, O_DIRECT is never enough to force data to the drives, it merely
* tries to bypass the kernel cache, so we still need O_SYNC or fsync().
*/

This seems wrong because fsync() can win if there are two writes before
the sync call. Can kernels not issue fsync() if the write was O_DIRECT?
If that is the cause, we should document it.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Use of O_DIRECT only for open_* sync options
Date: 2011-01-20 02:12:29
Message-ID: AANLkTinz7CtONGSoSCh+dPg0i5a6b3juSkOcvdtoAP6F@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 19, 2011 at 1:53 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Is there a reason we only use O_DIRECT with open_* sync options?
> xlogdefs.h says:
>
> /*
>  *  Because O_DIRECT bypasses the kernel buffers, and because we never
>  *  read those buffers except during crash recovery, it is a win to use
>  *  it in all cases where we sync on each write().  We could allow O_DIRECT
>  *  with fsync(), but because skipping the kernel buffer forces writes out
>  *  quickly, it seems best just to use it for O_SYNC.  It is hard to imagine
>  *  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
>  *  Also, O_DIRECT is never enough to force data to the drives, it merely
>  *  tries to bypass the kernel cache, so we still need O_SYNC or fsync().
>  */
>
> This seems wrong because fsync() can win if there are two writes before
> the sync call.

Well, the comment does say "...in all cases where we sync on each
write()". But that's certainly not true of WAL, so I dunno.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Use of O_DIRECT only for open_* sync options
Date: 2011-01-23 13:43:11
Message-ID: 4D3C306F.8030209@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> xlogdefs.h says:
>
> /*
> * Because O_DIRECT bypasses the kernel buffers, and because we never
> * read those buffers except during crash recovery, it is a win to use
> * it in all cases where we sync on each write(). We could allow O_DIRECT
> * with fsync(), but because skipping the kernel buffer forces writes out
> * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> * Also, O_DIRECT is never enough to force data to the drives, it merely
> * tries to bypass the kernel cache, so we still need O_SYNC or fsync().
> */
>
> This seems wrong because fsync() can win if there are two writes before
> the sync call. Can kernels not issue fsync() if the write was O_DIRECT?
> If that is the cause, we should document it.
>

The comment does look busted, because you did imagine exactly a case
where they might be combined. The only incompatibility that I'm aware
of is that O_DIRECT requires reads and writes to be aligned properly, so
you can't use it in random application code unless it's aware of that.
O_DIRECT and fsync are compatible; for example, MySQL allows combining
the two: http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html

(That whole bit of documentation around innodb_flush_method includes
some very interesting observations around O_DIRECT actually)

I'm starting to consider the idea that much of the performance gains
seen on earlier systems with O_DIRECT was because it decreased CPU usage
shuffling things into the OS cache, rather than its impact on avoiding
pollution of said cache. On Linux for example, its main accomplishment
is decribed like this: "File I/O is done directly to/from user space
buffers."
http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html The
earliest paper on the implementation suggests a big decrease in CPU
overhead from that:
http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html

Impossible to guess whether that's more true ("CPU cache pollution is a
bigger problem now") or less true ("drives are much slower relative to
CPUs now") today. I'm trying to remain agnostic and let the benchmarks
offer an opinion instead.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Use of O_DIRECT only for open_* sync options
Date: 2011-01-25 00:19:45
Message-ID: 201101250019.p0P0Jj900973@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> Bruce Momjian wrote:
> > xlogdefs.h says:
> >
> > /*
> > * Because O_DIRECT bypasses the kernel buffers, and because we never
> > * read those buffers except during crash recovery, it is a win to use
> > * it in all cases where we sync on each write(). We could allow O_DIRECT
> > * with fsync(), but because skipping the kernel buffer forces writes out
> > * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> > * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> > * Also, O_DIRECT is never enough to force data to the drives, it merely
> > * tries to bypass the kernel cache, so we still need O_SYNC or fsync().
> > */
> >
> > This seems wrong because fsync() can win if there are two writes before
> > the sync call. Can kernels not issue fsync() if the write was O_DIRECT?
> > If that is the cause, we should document it.
> >
>
> The comment does look busted, because you did imagine exactly a case
> where they might be combined. The only incompatibility that I'm aware
> of is that O_DIRECT requires reads and writes to be aligned properly, so
> you can't use it in random application code unless it's aware of that.
> O_DIRECT and fsync are compatible; for example, MySQL allows combining
> the two: http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html
>
> (That whole bit of documentation around innodb_flush_method includes
> some very interesting observations around O_DIRECT actually)
>
> I'm starting to consider the idea that much of the performance gains
> seen on earlier systems with O_DIRECT was because it decreased CPU usage
> shuffling things into the OS cache, rather than its impact on avoiding
> pollution of said cache. On Linux for example, its main accomplishment
> is decribed like this: "File I/O is done directly to/from user space
> buffers."
> http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html The
> earliest paper on the implementation suggests a big decrease in CPU
> overhead from that:
> http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html
>
> Impossible to guess whether that's more true ("CPU cache pollution is a
> bigger problem now") or less true ("drives are much slower relative to
> CPUs now") today. I'm trying to remain agnostic and let the benchmarks
> offer an opinion instead.

Agreed. Perhaps we need a separate setting to turn direct I/O on and
off, and decouple wal_sync_method and direct I/O.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Use of O_DIRECT only for open_* sync options
Date: 2011-03-11 11:47:21
Message-ID: 201103111147.p2BBlLN29891@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> Bruce Momjian wrote:
> > xlogdefs.h says:
> >
> > /*
> > * Because O_DIRECT bypasses the kernel buffers, and because we never
> > * read those buffers except during crash recovery, it is a win to use
> > * it in all cases where we sync on each write(). We could allow O_DIRECT
> > * with fsync(), but because skipping the kernel buffer forces writes out
> > * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> > * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> > * Also, O_DIRECT is never enough to force data to the drives, it merely
> > * tries to bypass the kernel cache, so we still need O_SYNC or fsync().
> > */
> >
> > This seems wrong because fsync() can win if there are two writes before
> > the sync call. Can kernels not issue fsync() if the write was O_DIRECT?
> > If that is the cause, we should document it.
> >
>
> The comment does look busted, because you did imagine exactly a case
> where they might be combined. The only incompatibility that I'm aware
> of is that O_DIRECT requires reads and writes to be aligned properly, so
> you can't use it in random application code unless it's aware of that.
> O_DIRECT and fsync are compatible; for example, MySQL allows combining
> the two: http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html

C comment updated in git head:

* Because O_DIRECT bypasses the kernel buffers, and because we never
* read those buffers except during crash recovery or if wal_level != minimal,
* it is a win to use it in all cases where we sync on each write(). We could
* allow O_DIRECT with fsync(), but it is unclear if fsync() could process
* writes not buffered in the kernel. Also, O_DIRECT is never enough to force
* data to the drives, it merely tries to bypass the kernel cache, so we still
* need O_SYNC/O_DSYNC.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +