Re: [PATCHES] O_DIRECT for WAL writes

Lists: pgsql-hackerspgsql-patches
From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-patches(at)postgresql(dot)org
Subject: O_DIRECT for WAL writes
Date: 2005-05-26 08:04:01
Message-ID: 20050526155748.3D95.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

This patch ticks off the following TODO items:
Consider use of open/fcntl(O_DIRECT) to minimize OS caching, especially for WAL writes.

The patch adds a new choice "open_direct" to wal_sync_method.
It uses O_DIRECT flags for WAL writes, like O_SYNC.

I had sent a patch looked like this before
(http://candle.pha.pa.us/mhonarc/patches2/msg00131.html)
but I found it is not always needed to write multiple pages in one write()
because most disks have writeback-cache. So, I left only O_DIRECT routines
in the patch and it impacts a present source code less.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment Content-Type Size
xlog.diff application/octet-stream 3.6 KB

From: Neil Conway <neilc(at)samurai(dot)com>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-26 14:15:07
Message-ID: 4295D9EB.4000703@samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro wrote:
> The patch adds a new choice "open_direct" to wal_sync_method.
> It uses O_DIRECT flags for WAL writes, like O_SYNC.

Have you looked at what the performance difference of this option is?
For example, these benchmark results seem to indicate that an older
version of the patch is not a performance win, at least for OSDL's workload:

http://www.mail-archive.com/pgsql-patches(at)postgresql(dot)org/msg07186.html

Is this data still applicable to the revised patch?

-Neil


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-30 01:59:59
Message-ID: 20050530094517.3DD8.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Neil Conway <neilc(at)samurai(dot)com> wrote:

> > The patch adds a new choice "open_direct" to wal_sync_method.
> Have you looked at what the performance difference of this option is?

Yes, I've tested pgbench and dbt2 and their performances have improved.
The two results are as follows:

1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
(attached image)
tps | wal_sync_method
-------+-------------------------------------------------------
147.0 | open_direct + write multipage (previous patch)
147.2 | open_direct (this patch)
109.9 | open_sync

2. dbt2 100WH on two opterons, 8GB mem, 12 SATA-RAID disks, Linux 2.4.20
tpm | wal_sync_method
--------+------------------------------------------------------
1183.9 | open_direct (this patch)
911.3 | fsync

> http://www.mail-archive.com/pgsql-patches(at)postgresql(dot)org/msg07186.html
> Is this data still applicable to the revised patch?

Direct-IO might be good on some machines, and bad on others.
This data is another reason that I revised the patch;
If you don't use open_direct, WAL writer behaves quite similarly to former.

However, the performances did not go down at least on my benchmarks.
I have no idea why the above data was bad...

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment Content-Type Size
image/png 26.4 KB

From: Neil Conway <neilc(at)samurai(dot)com>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-30 06:29:40
Message-ID: 1117434580.23266.31.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2005-05-30 at 10:59 +0900, ITAGAKI Takahiro wrote:
> Yes, I've tested pgbench and dbt2 and their performances have improved.
> The two results are as follows:
>
> 1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
> (attached image)
> tps | wal_sync_method
> -------+-------------------------------------------------------
> 147.0 | open_direct + write multipage (previous patch)
> 147.2 | open_direct (this patch)
> 109.9 | open_sync

I'm surprised this makes as much of a difference as that benchmark would
suggest. I wonder if we're benchmarking the right thing, though: is
opening a file with O_DIRECT sufficient to ensure that a write(2) does
not return until the data has hit disk? (As would be the case with
O_SYNC.) O_DIRECT means the OS will attempt to minimize caching, but
that is not necessarily the same thing: for example, I can imagine an
implementation in which the kernel would submit the appropriate I/O to
the disk when it sees a write(2) on a file opened with O_DIRECT, but
then let the write(2) return before getting confirmation from the disk
that the I/O has succeeded or failed. From googling, the MySQL
documentation for innodb_flush_method notes:

This option is only relevant on Unix systems. If set to
fdatasync, InnoDB uses fsync() to flush both the data and log
files. If set to O_DSYNC, InnoDB uses O_SYNC to open and flush
the log files, but uses fsync() to flush the datafiles. If
O_DIRECT is specified (available on some GNU/Linux versions
starting from MySQL 4.0.14), InnoDB uses O_DIRECT to open the
datafiles, and uses fsync() to flush both the data and log
files.

That would suggest O_DIRECT by itself is not sufficient to force a flush
to disk -- if anyone has some more definitive evidence that would be
welcome.

Anyway, if the above is true, we'll need to use O_DIRECT as well as one
of the existing wal_sync_methods.

BTW, from the patch:

+ /* TODO: Aligment depends on OS and filesystem. */
+ #define O_DIRECT_BUFFER_ALIGN 4096

I suppose there's no reasonable way to autodetect this, so we'll need to
expose it as a GUC variable (or perhaps a configure option), which is a
bit unfortunate.

-Neil


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-30 06:52:09
Message-ID: 6455.1117435929@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Neil Conway <neilc(at)samurai(dot)com> writes:
> I wonder if we're benchmarking the right thing, though: is
> opening a file with O_DIRECT sufficient to ensure that a write(2) does
> not return until the data has hit disk?

Some googling suggests so, eg
http://www.die.net/doc/linux/man/man2/open.2.html

There are several useful tidbits about O_DIRECT on that page,
including this quote:

> "The thing that has always disturbed me about O_DIRECT is that the whole
> interface is just stupid, and was probably designed by a deranged monkey
> on some serious mind-controlling substances." -- Linus

Somehow I find that less than confidence-building...

regards, tom lane


From: Neil Conway <neilc(at)samurai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-30 07:04:41
Message-ID: 1117436681.23266.41.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2005-05-30 at 02:52 -0400, Tom Lane wrote:
> Some googling suggests so, eg
> http://www.die.net/doc/linux/man/man2/open.2.html

Well, that claims that "data is guaranteed to have been transferred",
but transferred to *where* is the question :) Transferring data to the
disk's buffers and then not asking for the buffer to be flushed is not
sufficient, for example. IMHO the fact that InnoDB uses both O_DIRECT
and fsync() is more convincing. I'm still looking for a definitive
answer, though.

The other question is whether these semantics are identical among the
various O_DIRECT implementations (e.g. Linux, FreeBSD, AIX, IRIX, and
others).

-Neil


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-30 08:04:48
Message-ID: 429AC920.6080809@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Neil Conway <neilc(at)samurai(dot)com> writes:
>>is opening a file with O_DIRECT sufficient to ensure that
>>a write(2) does not return until the data has hit disk?
>
> Some googling suggests so, eg
> http://www.die.net/doc/linux/man/man2/open.2.html

Really? On that page I read:
"O_DIRECT...at the completion of the read(2) or write(2)
system call, data is guaranteed to have been transferred."
which sounds to me like transfered to the device's cache
but not necessarily flushed through the device's cache.
It says nothing about physical media. That wording feels
different to me from O_SYNC which reads:
"O_SYNC will block the calling process until the data has
been physically written to the underlying hardware."
which does suggest to me that it writes to physical media.
Or am I reading that wrong?

PS: I've gotten way out of my depth here, but...

...attempting to browse the Linux source(!!)

Looking at the O_SYNC stuff in ext3:
http://lxr.linux.no/source/fs/ext3/file.c#L67
it looks like in this conditional:
if (file->f_flags & O_SYNC) {
...
goto force_commit;
}
the goto branch calls ext3_force_commit() in much the
same way that it seems fsync() does here:
http://lxr.linux.no/source/fs/ext3/fsync.c#L71
so I believe O_SYNC does at least as much as fsync().

However I can't find O_DIRECT anywhere in the ext3 stuff,
so if it does work it's less obvious how or if it could.

Moreover I see O_SYNC used lots of places:
http://lxr.linux.no/ident?i=O_SYNC
in various places like fs/ext3/; and and I don't
see O_DIRECT in nearly as many places
http://lxr.linux.no/ident?i=O_DIRECT
It looks like reiserfs and xfs seem look at O_DIRECT,
but ext3 doesn't appear to unless it's somewhere
outside the fs/ext3 directory.

PPS: Of course not even fsync() flushed correctly until very recent kernels:
http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
In that article Jeff Garzik (the linux SATA driver guy) suggests
that until very recent kernels ext3 did not have write barrier
support that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE
(SCSI) commands even on fsync.

PPPS: No, I don't understand the kernel - I'm just showing what quick
grep commands showed without any deep understanding.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-30 15:24:54
Message-ID: 13664.1117466694@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Neil Conway <neilc(at)samurai(dot)com> writes:
> On Mon, 2005-05-30 at 02:52 -0400, Tom Lane wrote:
> Well, that claims that "data is guaranteed to have been transferred",
> but transferred to *where* is the question :)

Oh, I see what you are worried about. I think you are right: what the
doc promises is only that the DMA transfer has finished (ie, it's safe
to scribble on your buffer again). So you'd still need an fsync;
which makes O_DIRECT orthogonal to wal_sync_method rather than a
valid choice for it. (Hm, I wonder if specifying both O_DIRECT and
O_SYNC works ...)

> The other question is whether these semantics are identical among the
> various O_DIRECT implementations (e.g. Linux, FreeBSD, AIX, IRIX, and
> others).

Wouldn't count on it :-(. One thing I'm particularly worried about is
buffer cache consistency: does the kernel guarantee to flush any buffers
it has that overlap the O_DIRECT write operation? Without this, an
application reading the WAL using normal non-O_DIRECT I/O might see the
wrong data; which is bad news for PITR.

regards, tom lane


From: Neil Conway <neilc(at)samurai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-05-31 01:08:27
Message-ID: 1117501707.6678.18.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2005-05-30 at 11:24 -0400, Tom Lane wrote:
> Wouldn't count on it :-(. One thing I'm particularly worried about is
> buffer cache consistency: does the kernel guarantee to flush any buffers
> it has that overlap the O_DIRECT write operation?

At least on Linux I believe the kernel guarantees consistency between
O_DIRECT and non-O_DIRECT operations. From googling, it seems AIX also
provides consistency, albeit not for free[1]:

To avoid consistency issues, if there are multiple calls to open
a file and one or more of the calls did not specify O_DIRECT and
another open specified O_DIRECT, the file stays in the normal
cached I/O mode. Similarly, if the file is mapped into memory
through the shmat() or mmap() system calls, it stays in normal
cached mode. If the last conflicting, non-direct access is
eliminated, then the file system will move the file into direct
I/O mode (either by using the close(), munmap(), or shmdt()
subroutines). Changing from normal mode to direct I/O mode can
be expensive because all modified pages in memory will have to
be flushed to disk at that point.

-Neil

[1]
http://publib16.boulder.ibm.com/pseries/en_US/aixbman/prftungd/diskperf9.htm


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-02 00:08:14
Message-ID: 1117670894.2922.339.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2005-05-30 at 16:29 +1000, Neil Conway wrote:
> On Mon, 2005-05-30 at 10:59 +0900, ITAGAKI Takahiro wrote:
> > Yes, I've tested pgbench and dbt2 and their performances have improved.
> > The two results are as follows:
> >
> > 1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
> > (attached image)
> > tps | wal_sync_method
> > -------+-------------------------------------------------------
> > 147.0 | open_direct + write multipage (previous patch)
> > 147.2 | open_direct (this patch)
> > 109.9 | open_sync
>
> I'm surprised this makes as much of a difference as that benchmark would
> suggest. I wonder if we're benchmarking the right thing, though: is
> opening a file with O_DIRECT sufficient to ensure that a write(2) does
> not return until the data has hit disk? (As would be the case with
> O_SYNC.) O_DIRECT means the OS will attempt to minimize caching, but
> that is not necessarily the same thing: for example, I can imagine an
> implementation in which the kernel would submit the appropriate I/O to
> the disk when it sees a write(2) on a file opened with O_DIRECT, but
> then let the write(2) return before getting confirmation from the disk
> that the I/O has succeeded or failed. From googling, the MySQL
> documentation for innodb_flush_method notes:
>
> This option is only relevant on Unix systems. If set to
> fdatasync, InnoDB uses fsync() to flush both the data and log
> files. If set to O_DSYNC, InnoDB uses O_SYNC to open and flush
> the log files, but uses fsync() to flush the datafiles. If
> O_DIRECT is specified (available on some GNU/Linux versions
> starting from MySQL 4.0.14), InnoDB uses O_DIRECT to open the
> datafiles, and uses fsync() to flush both the data and log
> files.
>
> That would suggest O_DIRECT by itself is not sufficient to force a flush
> to disk -- if anyone has some more definitive evidence that would be
> welcome.

I know I'm late to this discussion, and I haven't made it all the way
through this thread to see if your questions on Linux writes were
resolved. If you are still interested, I recommend read a very good
one page description of reliable writes buried in the Data Center Linux
Goals and Capabilities document. It is on page 159 of the document, the
item is "R.ReliableWrites" in this _giant PDF file (do a wget and open
it locally ; don't try to read it directly):

http://www.osdlab.org/lab_activities/data_center_linux/DCL_Goals_Capabilities_1.1.pdf

The information came from me interviewing Daniel McNeil, an OSDL
Engineer who wrote and tested much of the Linux async IO code, after I
was similarly confused about when a write is "guaranteed". Reliable
writes, as you can imagine, are very important to Data Center folks,
which is how it happens to be in this document.

Hope this helps.
>
> Anyway, if the above is true, we'll need to use O_DIRECT as well as one
> of the existing wal_sync_methods.
>
> BTW, from the patch:
>
> + /* TODO: Aligment depends on OS and filesystem. */
> + #define O_DIRECT_BUFFER_ALIGN 4096
>
> I suppose there's no reasonable way to autodetect this, so we'll need to
> expose it as a GUC variable (or perhaps a configure option), which is a
> bit unfortunate.
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
--
Mary Edie Meredith
maryedie(at)osdl(dot)org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs


From: Neil Conway <neilc(at)samurai(dot)com>
To: maryedie(at)osdl(dot)org
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-02 01:39:25
Message-ID: 1117676365.6678.101.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, 2005-06-01 at 17:08 -0700, Mary Edie Meredith wrote:
> I know I'm late to this discussion, and I haven't made it all the way
> through this thread to see if your questions on Linux writes were
> resolved. If you are still interested, I recommend read a very good
> one page description of reliable writes buried in the Data Center Linux
> Goals and Capabilities document.

This suggests that on Linux a write() on a file opened with O_DIRECT has
the same synchronization guarantees as a write() on a file opened with
O_SYNC, which is precisely the opposite of what was concluded down
thread. So now I'm more confused :)

(Regardless of behavior on Linux, I would guess O_DIRECT doesn't behave
this way on all platforms -- for example, FreeBSD's open(2) manpage does
not mention I/O synchronization when referring to O_DIRECT. So even if
we can skip the fsync() with O_DIRECT on Linux, I doubt we'll be able to
do that on all platforms.)

-Neil


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-02 18:49:28
Message-ID: 1117738168.2922.411.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 2005-06-02 at 11:39 +1000, Neil Conway wrote:
> On Wed, 2005-06-01 at 17:08 -0700, Mary Edie Meredith wrote:
> > I know I'm late to this discussion, and I haven't made it all the way
> > through this thread to see if your questions on Linux writes were
> > resolved. If you are still interested, I recommend read a very good
> > one page description of reliable writes buried in the Data Center Linux
> > Goals and Capabilities document.
>
> This suggests that on Linux a write() on a file opened with O_DIRECT has
> the same synchronization guarantees as a write() on a file opened with
> O_SYNC, which is precisely the opposite of what was concluded down
> thread. So now I'm more confused :)
>
> (Regardless of behavior on Linux, I would guess O_DIRECT doesn't behave
> this way on all platforms -- for example, FreeBSD's open(2) manpage does
> not mention I/O synchronization when referring to O_DIRECT. So even if
> we can skip the fsync() with O_DIRECT on Linux, I doubt we'll be able to
> do that on all platforms.)

My understanding is that O_DIRECT means "direct" as in "no buffering by
the OS" which implies that if you write from your buffer, the write is
not going to return unless the OS thinks the write is completed (or
unless you are using Async IO). Otherwise you might reuse your buffer
(there _is no other buffer after all) and if the write were incomplete
before refill you buffer for another, the first write might go from your
buffer with wrong data.

Now if you want to avoid _waiting for the write to complete, you need to
employ async io, which is why most databases that support direct io for
their datafiles also have implemented some form of async io as well
(either via OS calls or some built-in mechanism as is the case with
SAP-DB). With AIO you have to manage your buffers so that you reuse them
only when you are notified the IO is completed. Historically this was
done with raw datafiles, but currently (at least for Linux) you can also
do this with files. For logging, though, I think you want synchronous
IO to guarantee order.

The cool thing about buffering the datafile data yourself is that _you
(the database engine) can control what stays in (shared) memory and what
does not. You can add configuration options or add intelligence, so
that frequently used data (like hot indexes) can stay in memory
indefinitely. The OS can never do that so specifically. In addition,
you can avoid having data from table scans overwrite hot objects. Of
course, at the moment you are discussing the use for logging, but there
should be benefits to extending this to datafiles as well, assuming you
also implement async io.

Bottom line: if you do not implement direct/async IO so that you
optimize caching of hot database objects and minimize memory utilization
of objects used once, you are probably leaving performance on the table
for datafiles.

Daniel is on vacation, but I will ask him to confirm once he returns.
>
> -Neil
>
--
Mary Edie Meredith
maryedie(at)osdl(dot)org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs


From: Neil Conway <neilc(at)samurai(dot)com>
To: maryedie(at)osdl(dot)org
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-03 00:37:39
Message-ID: 1117759059.22984.17.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 2005-06-02 at 11:49 -0700, Mary Edie Meredith wrote:
> My understanding is that O_DIRECT means "direct" as in "no buffering by
> the OS" which implies that if you write from your buffer, the write is
> not going to return unless the OS thinks the write is completed

Right, I think that's definitely the case. The question is whether a
write() under O_DIRECT will also flush the disk's write cache -- i.e.
when the write() completes, we need it to be durable over a spontaneous
power loss. fsync() or O_SYNC should provide this (modulo braindamaged
IDE hardware), but I wouldn't be surprised if O_DIRECT by itself will
not (otherwise you would hurt the performance of applications using
O_DIRECT that don't need these durability guarantees).

> Bottom line: if you do not implement direct/async IO so that you
> optimize caching of hot database objects and minimize memory utilization
> of objects used once, you are probably leaving performance on the table
> for datafiles.

Absolutely -- patches are welcome :) I agree async IO + O_DIRECT in some
form would be interesting, but the changes required are far from trivial
-- my guess is there are lower hanging fruit.

-Neil


From: Mary Edie Meredith <maryedie(at)osdl(dot)org>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-03 16:43:13
Message-ID: 1117816993.2922.523.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, 2005-06-03 at 10:37 +1000, Neil Conway wrote:
> On Thu, 2005-06-02 at 11:49 -0700, Mary Edie Meredith wrote:
> > My understanding is that O_DIRECT means "direct" as in "no buffering by
> > the OS" which implies that if you write from your buffer, the write is
> > not going to return unless the OS thinks the write is completed
>
> Right, I think that's definitely the case. The question is whether a
> write() under O_DIRECT will also flush the disk's write cache -- i.e.
> when the write() completes, we need it to be durable over a spontaneous
> power loss. fsync() or O_SYNC should provide this (modulo braindamaged
> IDE hardware), but I wouldn't be surprised if O_DIRECT by itself will
> not (otherwise you would hurt the performance of applications using
> O_DIRECT that don't need these durability guarantees).

My understanding is that for Linux, with respect to "Guaranteed writes"
a write with the fd opened as O_DIRECT behaves the _same as a
write/fsync on an fd opened without O_DIRECT, i.e. whether the write
completes all the way to the disk itself depends on when the particular
device responds to those equivalent sequences.

Quoting from the Capabilities Document "'Guarantee a write completion '
means the operating system has issued a write to the I/O subsystem, and
the device has returned an affirmative response. Once an affirmative
response is sent, recovery from power down without data loss is the
responsibility of the I/O subsystem." Don't most disk drives have a
battery backup so that it can flush its cache if power is lost? Ditto
for Disk arrays with fancier cache and write-back set on (not advised
for the paranoid).

Looking at this from another angle, is there really any way that you can
say a write is truly guaranteed in the event of a failure? I think in
the end to be safe, you cannot. That's why (and I'm not telling you
anything new) there is no substitute for backups and log archiving for
databases. Databases must be able to recognize the last _good
transaction logged and roll forward to that from the backup (including
detecting partial writes to the log). I'm sure the PostgreSQL community
has worked hard to do the equivalent of that within the PostgreSQL
architecture.

>
> > Bottom line: if you do not implement direct/async IO so that you
> > optimize caching of hot database objects and minimize memory utilization
> > of objects used once, you are probably leaving performance on the table
> > for datafiles.
>
> Absolutely -- patches are welcome :)
How about testing patches (--:

> I agree async IO + O_DIRECT in some
> form would be interesting, but the changes required are far from trivial
> -- my guess is there are lower hanging fruit.
Since the log has to be sequential, I think you are on the right track!

Believe me, I didn't mean to imply that it is trivial to implement. For
those databases that have async/direct, the functionality appeared over
a span of several major versions. I just thought I detected an opinion
that it would not help. Sorry for the misunderstanding. I absolutely
don't mean to sound critical. At OSDL we have the greatest respect for
the PostgreSQL community.

>
> -Neil

--
Mary Edie Meredith
maryedie(at)osdl(dot)org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Mary Edie Meredith <maryedie(at)osdl(dot)org>
Cc: Neil Conway <neilc(at)samurai(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-03 19:24:51
Message-ID: 20050603192451.GA25970@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, Jun 03, 2005 at 09:43:13 -0700,
Mary Edie Meredith <maryedie(at)osdl(dot)org> wrote:
>
> Looking at this from another angle, is there really any way that you can
> say a write is truly guaranteed in the event of a failure? I think in
> the end to be safe, you cannot. That's why (and I'm not telling you
> anything new) there is no substitute for backups and log archiving for
> databases. Databases must be able to recognize the last _good
> transaction logged and roll forward to that from the backup (including
> detecting partial writes to the log). I'm sure the PostgreSQL community
> has worked hard to do the equivalent of that within the PostgreSQL
> architecture.

Some assumptions are made about what order blocks are written to the disk.
If these assumptions are not true, you may not be able to recover using
the WAL log and have to resort to falling back to your last consistant
snapshot.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Neil Conway <neilc(at)samurai(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: O_DIRECT for WAL writes
Date: 2005-06-04 16:52:55
Message-ID: 200506041652.j54Gqtq04034@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


I think the conclusion from the discussion is that O_DIRECT is in
addition to the sync method, rather than in place of it, because
O_DIRECT doesn't have the same media write guarantees as fsync(). Would
you update the patch to do O_DIRECT in addition to O_SYNC or fsync() and
see if there is a performance win?

Thanks.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Neil Conway <neilc(at)samurai(dot)com> wrote:
>
> > > The patch adds a new choice "open_direct" to wal_sync_method.
> > Have you looked at what the performance difference of this option is?
>
> Yes, I've tested pgbench and dbt2 and their performances have improved.
> The two results are as follows:
>
> 1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
> (attached image)
> tps | wal_sync_method
> -------+-------------------------------------------------------
> 147.0 | open_direct + write multipage (previous patch)
> 147.2 | open_direct (this patch)
> 109.9 | open_sync
>
> 2. dbt2 100WH on two opterons, 8GB mem, 12 SATA-RAID disks, Linux 2.4.20
> tpm | wal_sync_method
> --------+------------------------------------------------------
> 1183.9 | open_direct (this patch)
> 911.3 | fsync
>
>
>
> > http://www.mail-archive.com/pgsql-patches(at)postgresql(dot)org/msg07186.html
> > Is this data still applicable to the revised patch?
>
> Direct-IO might be good on some machines, and bad on others.
> This data is another reason that I revised the patch;
> If you don't use open_direct, WAL writer behaves quite similarly to former.
>
> However, the performances did not go down at least on my benchmarks.
> I have no idea why the above data was bad...
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-21 05:43:57
Message-ID: 20050621100918.43D3.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hi all,
O_DIRECT for WAL writes was discussed at
http://archives.postgresql.org/pgsql-patches/2005-06/msg00064.php
but I have some items that want to be discussed, so I would like to
re-post it to HACKERS.

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> I think the conclusion from the discussion is that O_DIRECT is in
> addition to the sync method, rather than in place of it, because
> O_DIRECT doesn't have the same media write guarantees as fsync(). Would
> you update the patch to do and see if there is a performance win?

I tested two combinations,
- fsync_direct: O_DIRECT+fsync()
- open_direct: O_DIRECT+O_SYNC
to compare them with O_DIRECT on my linux machine.
The pgbench results still shows a performance win:

scale| DBsize | open_sync | fsync=false | O_DIRECT only| fsync_direct | open_direct
-----+--------+-----------+--------------+--------------+--------------+---------------
10 | 150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 253.3(+ 0.3%)
100 | 1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%)
60runs * pgbench -c 10 -t 1000
on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

O_DIRECT, fsync_direct and open_direct show the same tendency of performance.
There were a win on scale=100, but no win on scale=10, which is a fully
in-memory benchmark.

The following items still want to be discussed:
- Are their names appropriate?
Simplify to 'direct'?
- Are both fsync_direct and open_direct necessary?
MySQL seems to use only O_DIRECT+fsync() combination.
- Is it ok to set the dio buffer alignment to BLCKSZ?
This is simple way to set the alignment to match many environment.
If it is not enough, BLCKSZ would be also a problem for direct io.

BTW, IMHO the major benefit of direct io is saving memory. O_DIRECT gives
a hint that OS should not cache WAL files. Without direct io, OS might make
a effort to cache WAL files, which will never be used, and might discard
data file cache.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-21 13:23:39
Message-ID: 1856.1119360219@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> I tested two combinations,
> - fsync_direct: O_DIRECT+fsync()
> - open_direct: O_DIRECT+O_SYNC
> to compare them with O_DIRECT on my linux machine.
> The pgbench results still shows a performance win:

> scale| DBsize | open_sync | fsync=false | O_DIRECT only| fsync_direct | open_direct
> -----+--------+-----------+--------------+--------------+--------------+---------------
> 10 | 150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 253.3(+ 0.3%)
> 100 | 1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%)
> 60runs * pgbench -c 10 -t 1000
> on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

Unfortunately, I cannot believe these numbers --- the near equality of
fsync off and fsync on means there is something very wrong with the
measurements. What I suspect is that your ATA drives are doing write
caching and thus the "fsyncs" are not really waiting for I/O at all.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-21 17:19:02
Message-ID: 200506211019.02716.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Takahiro,

> scale| DBsize | open_sync | fsync=false | O_DIRECT only| fsync_direct |
> open_direct
> -----+--------+-----------+--------------+--------------+--------------+
>--------------- 10 | 150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)|
> 253.6(+ 0.4%)| 253.3(+ 0.3%) 100 | 1.5GB | 102.7 tps | 117.8(+14.7%)|
> 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%) 60runs * pgbench -c 10 -t
> 1000
> on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

This looks pretty good. I'd like to try it out on some of our tests.
Will get back to you on this, but it looks to me like the O_DIRECT
results are good enough to consider accepting the patch.

What filesystem and mount options did you use for this test?

> - Are both fsync_direct and open_direct necessary?
> MySQL seems to use only O_DIRECT+fsync() combination.

MySQL doesn't support as many operating systems as we do. What OSes and
versions will support O_DIRECT?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-22 19:25:00
Message-ID: 87mzpiyz1v.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Unfortunately, I cannot believe these numbers --- the near equality of
> fsync off and fsync on means there is something very wrong with the
> measurements. What I suspect is that your ATA drives are doing write
> caching and thus the "fsyncs" are not really waiting for I/O at all.

I wonder whether it would make sense to have an automatic test for this
problem. I suspect there are lots of installations out there whose admins
don't realize that their hardware is doing this to them.

It shouldn't be too hard to test a few hundred or even a few thousand fsyncs
and calculate the seek time. If it implies a rotational speed over 15kRPM then
you know the drive is lying and the data storage is unreliable.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-22 19:50:04
Message-ID: 20689.1119469804@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Unfortunately, I cannot believe these numbers --- the near equality of
>> fsync off and fsync on means there is something very wrong with the
>> measurements. What I suspect is that your ATA drives are doing write
>> caching and thus the "fsyncs" are not really waiting for I/O at all.

> I wonder whether it would make sense to have an automatic test for this
> problem. I suspect there are lots of installations out there whose admins
> don't realize that their hardware is doing this to them.

Not sure about "automatic", but a simple little test program to measure
the speed of rewriting/fsyncing a small test file would surely be a nice
thing to have.

The reason I question "automatic" is that you really want to test each
drive being used, if the system has more than one; but Postgres has no
idea what the actual hardware layout is, and so no good way to know what
needs to be tested.

regards, tom lane


From: Curt Sampson <cjs(at)cynic(dot)net>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 03:14:07
Message-ID: Pine.NEB.4.62.0506231205490.12377@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 22 Jun 2005, Greg Stark wrote:

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> Unfortunately, I cannot believe these numbers --- the near equality of
>> fsync off and fsync on means there is something very wrong with the
>> measurements. What I suspect is that your ATA drives are doing write
>> caching and thus the "fsyncs" are not really waiting for I/O at all.
>
> I wonder whether it would make sense to have an automatic test for this
> problem. I suspect there are lots of installations out there whose admins
> don't realize that their hardware is doing this to them.

But is it really a problem? I somewhere got the impression that some
drives, on power failure, will be able to keep going for long enough to
write out the cache and park the heads anyway. If so, the drive is still
guaranteeing the write.

But regardless, perhaps we can add some stuff to the various OSes'
startup scripts that could help with this. For example, in NetBSD you
can "dkctl <device> setcache r" for most any disk device (certainly all
SCSI and ATA) to enable the read cache and disable the write cache.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.NetBSD.org
Make up enjoying your city life...produced by BIC CAMERA


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Curt Sampson <cjs(at)cynic(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 03:51:34
Message-ID: 389.1119498694@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Curt Sampson <cjs(at)cynic(dot)net> writes:
> But regardless, perhaps we can add some stuff to the various OSes'
> startup scripts that could help with this. For example, in NetBSD you
> can "dkctl <device> setcache r" for most any disk device (certainly all
> SCSI and ATA) to enable the read cache and disable the write cache.

[ shudder ] I can see the complaints now: "Merely starting up Postgres
cut my overall system performance by a factor of 10! I wasn't even
using it!! What a piece of junk!!!" I can hardly think of a better
way to drive away people with a marginal interest in the database...

This can *not* be default behavior, and unfortunately that limits its
value quite a lot.

regards, tom lane


From: Curt Sampson <cjs(at)cynic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 03:54:15
Message-ID: Pine.NEB.4.62.0506231253040.12377@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, 22 Jun 2005, Tom Lane wrote:

> [ shudder ] I can see the complaints now: "Merely starting up Postgres
> cut my overall system performance by a factor of 10!

Yeah, quite the scenario.

> This can *not* be default behavior, and unfortunately that limits its
> value quite a lot.

Indeed. Maybe it's best just to document this stuff for the various
OSes, and let the admins deal with configuring their machines.

But you know, it might be a reasonable option switch, or something.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.NetBSD.org
Make up enjoying your city life...produced by BIC CAMERA


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Curt Sampson <cjs(at)cynic(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 04:00:19
Message-ID: 455.1119499219@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

[ on the other point... ]

Curt Sampson <cjs(at)cynic(dot)net> writes:
> But is it really a problem? I somewhere got the impression that some
> drives, on power failure, will be able to keep going for long enough to
> write out the cache and park the heads anyway. If so, the drive is still
> guaranteeing the write.

If the drives worked that way, we'd not be seeing any problem, but we do
see problems. Without having a whole lot of data to back it up, I would
think that keeping the platter spinning is no problem (sheer rotational
inertia) but seeking to a lot of new tracks to write randomly-positioned
dirty sectors would require significant energy that just ain't there
once the power drops. I seem to recall reading that the seek actuators
eat the largest share of power in a running drive...

regards, tom lane


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Curt Sampson <cjs(at)cynic(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 04:11:58
Message-ID: Pine.LNX.4.58.0506231410270.20908@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 23 Jun 2005, Tom Lane wrote:

> [ on the other point... ]
>
> Curt Sampson <cjs(at)cynic(dot)net> writes:
> > But is it really a problem? I somewhere got the impression that some
> > drives, on power failure, will be able to keep going for long enough to
> > write out the cache and park the heads anyway. If so, the drive is still
> > guaranteeing the write.
>
> If the drives worked that way, we'd not be seeing any problem, but we do
> see problems. Without having a whole lot of data to back it up, I would
> think that keeping the platter spinning is no problem (sheer rotational
> inertia) but seeking to a lot of new tracks to write randomly-positioned
> dirty sectors would require significant energy that just ain't there
> once the power drops. I seem to recall reading that the seek actuators
> eat the largest share of power in a running drive...

I've seen discussion about disks behaving this way. There's no magic:
they're battery backed.

Thanks,

Gavin


From: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Curt Sampson <cjs(at)cynic(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 04:25:34
Message-ID: e692861c05062221257727ae60@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On 6/23/05, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> wrote:

> > inertia) but seeking to a lot of new tracks to write randomly-positioned
> > dirty sectors would require significant energy that just ain't there
> > once the power drops. I seem to recall reading that the seek actuators
> > eat the largest share of power in a running drive...
>
> I've seen discussion about disks behaving this way. There's no magic:
> they're battery backed.

Nah this isn't always the case, for example some of the IBM deskstars
had a few tracks at the start of the disk reserved.. if the power
failed the head retracted all the way and used the rotational energy
to power it long enough to write out the cache.. At start the drive
would read it back in and finish flushing it.

.... unfortunately firmware bugs made it not always wait until the
head returned to the start to begin writing...

I'm not sure what other drives do this (er, well do it correctly :) ).


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc: Curt Sampson <cjs(at)cynic(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 04:33:35
Message-ID: 794.1119501215@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
>> Curt Sampson <cjs(at)cynic(dot)net> writes:
>>> But is it really a problem? I somewhere got the impression that some
>>> drives, on power failure, will be able to keep going for long enough to
>>> write out the cache and park the heads anyway. If so, the drive is still
>>> guaranteeing the write.

> I've seen discussion about disks behaving this way. There's no magic:
> they're battery backed.

Oh, sure, then it's easy ;-)

The bottom line here seems to be the same as always: you can't run an
industrial strength database on piece-of-junk consumer grade hardware.
Our problem is that because the software is free, people expect to run
it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
they blame us when they don't get the same results as the guy running
Oracle on million-dollar triply-redundant server hardware. Oh well.

regards, tom lane


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Curt Sampson <cjs(at)cynic(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 04:47:40
Message-ID: Pine.LNX.4.58.0506231446210.21186@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 23 Jun 2005, Tom Lane wrote:

> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> >> Curt Sampson <cjs(at)cynic(dot)net> writes:
> >>> But is it really a problem? I somewhere got the impression that some
> >>> drives, on power failure, will be able to keep going for long enough to
> >>> write out the cache and park the heads anyway. If so, the drive is still
> >>> guaranteeing the write.
>
> > I've seen discussion about disks behaving this way. There's no magic:
> > they're battery backed.
>
> Oh, sure, then it's easy ;-)
>
> The bottom line here seems to be the same as always: you can't run an
> industrial strength database on piece-of-junk consumer grade hardware.
> Our problem is that because the software is free, people expect to run
> it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
> they blame us when they don't get the same results as the guy running
> Oracle on million-dollar triply-redundant server hardware. Oh well.

If you ever need a second job, I recommend stand up comedy :-).

Gavin


From: Curt Sampson <cjs(at)cynic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 05:04:59
Message-ID: Pine.NEB.4.62.0506231351370.12377@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 23 Jun 2005, Tom Lane wrote:

> The bottom line here seems to be the same as always: you can't run an
> industrial strength database on piece-of-junk consumer grade hardware.

Sure you can, though it may take several bits of piece-of-junk
consumer-grade hardware. It's far more about how you set up your system
and implement recovery policies than it is about hardware.

I ran an ISP back in the '90s on old PC junk, and we had far better
uptime than most of our competitors running on expensive Sun gear. One
ISP was completely out for half a day because the tech. guy bent and
broke a hot-swappable circuit board while installing it, bringing down
the entire machine. (Pretty dumb of them to be running everything on a
single, irreplacable "high-availablity" system.)

> ...they blame us when they don't get the same results as the guy
> running Oracle on...

Now that phrase irritates me a bit. I've been using all this stuff for
a long time (Postgres on and off since QUEL, before SQL was dropped
in instead) and at this point, for the (perhaps slim) majority of
applications, I would say that PostgreSQL is a better database than
Oracle. It requires much, much less effort to get a system and its test
framework up and running under PostgreSQL than it does under Oracle,
PostgreSQL has far fewer stupid limitations, and in other areas, such
as performance, it competes reasonably well in a lot of cases. It's a
pretty impressive piece of work, thanks in large part to efforts put in
over the last few years.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.NetBSD.org
Make up enjoying your city life...produced by BIC CAMERA


From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 17:16:01
Message-ID: 20050623171601.GB89438@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, Jun 22, 2005 at 03:50:04PM -0400, Tom Lane wrote:
> The reason I question "automatic" is that you really want to test each
> drive being used, if the system has more than one; but Postgres has no
> idea what the actual hardware layout is, and so no good way to know what
> needs to be tested.

Would testing in the WAL directory be sufficient? Or at least better
than nothing? Of course we could test in the database directories as
well, but you never know if stuff's been symlinked elsewhere... err, we
can test for that, no?

In any case, it seems like it'd be good to try to test and throw a
warning if the drive appears to be caching or if we think the test might
not cover everything (ie symlinks in the data directory).
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


From: Douglas McNaught <doug(at)mcnaught(dot)org>
To: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-23 18:18:54
Message-ID: m24qbpc4xd.fsf@Douglas-McNaughts-Powerbook.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Jim C. Nasby" <decibel(at)decibel(dot)org> writes:

> Would testing in the WAL directory be sufficient? Or at least better
> than nothing? Of course we could test in the database directories as
> well, but you never know if stuff's been symlinked elsewhere... err, we
> can test for that, no?
>
> In any case, it seems like it'd be good to try to test and throw a
> warning if the drive appears to be caching or if we think the test might
> not cover everything (ie symlinks in the data directory).

I think it would make more sense to write the test as a separate
utility program--then the sysadmin can check the disks he cares
about. I don't personally see the need to burden the backend with
this.

-Doug


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-24 01:12:16
Message-ID: 200506240112.j5O1CGf12612@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >> Unfortunately, I cannot believe these numbers --- the near equality of
> >> fsync off and fsync on means there is something very wrong with the
> >> measurements. What I suspect is that your ATA drives are doing write
> >> caching and thus the "fsyncs" are not really waiting for I/O at all.
>
> > I wonder whether it would make sense to have an automatic test for this
> > problem. I suspect there are lots of installations out there whose admins
> > don't realize that their hardware is doing this to them.
>
> Not sure about "automatic", but a simple little test program to measure
> the speed of rewriting/fsyncing a small test file would surely be a nice
> thing to have.
>
> The reason I question "automatic" is that you really want to test each
> drive being used, if the system has more than one; but Postgres has no
> idea what the actual hardware layout is, and so no good way to know what
> needs to be tested.

Some folks have battery-backed cached controllers so they would appear
as not handling fsync when in fact they do.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-24 01:28:09
Message-ID: 21035.1119576489@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> The reason I question "automatic" is that you really want to test each
>> drive being used, if the system has more than one; but Postgres has no
>> idea what the actual hardware layout is, and so no good way to know what
>> needs to be tested.

> Some folks have battery-backed cached controllers so they would appear
> as not handling fsync when in fact they do.

Right, so something like refusing to start if we think fsync doesn't
work is probably not a hot idea. (Unless you want to provide a GUC
variable to override it...)

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Curt Sampson <cjs(at)cynic(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-24 01:54:57
Message-ID: 200506240154.j5O1svk18835@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> >> Curt Sampson <cjs(at)cynic(dot)net> writes:
> >>> But is it really a problem? I somewhere got the impression that some
> >>> drives, on power failure, will be able to keep going for long enough to
> >>> write out the cache and park the heads anyway. If so, the drive is still
> >>> guaranteeing the write.
>
> > I've seen discussion about disks behaving this way. There's no magic:
> > they're battery backed.
>
> Oh, sure, then it's easy ;-)
>
> The bottom line here seems to be the same as always: you can't run an
> industrial strength database on piece-of-junk consumer grade hardware.
> Our problem is that because the software is free, people expect to run
> it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
> they blame us when they don't get the same results as the guy running
> Oracle on million-dollar triply-redundant server hardware. Oh well.

At least we have an FAQ on this:

<H3><A name="3.7">3.7</A>) What computer hardware should I use?</H3>

<P>Because PC hardware is mostly compatible, people tend to believe that
all PC hardware is of equal quality. It is not. ECC RAM, SCSI, and
quality motherboards are more reliable and have better performance than
less expensive hardware. PostgreSQL will run on almost any hardware,
but if reliability and performance are important it is wise to
research your hardware options thoroughly. Our email lists can be used
to discuss hardware options and tradeoffs.</P>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-24 04:16:44
Message-ID: 20050624125750.4006.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Unfortunately, I cannot believe these numbers --- the near equality of
> fsync off and fsync on means there is something very wrong with the
> measurements. What I suspect is that your ATA drives are doing write
> caching and thus the "fsyncs" are not really waiting for I/O at all.

I think direct io and writeback-cache should be considered separate issues.
I guess that direct-io can make OSes not to cache WAL files and they will
use more memory to cache data files.

In my previous test, I had enabled writeback-cache of my drives
because of performance. But I understand that the cache should be
disabled for reliable writes from the discussion.
Also my checkpoint_segments setting might be too large against
the default. So I'll post the new results:

checkpoint_ | writeback |
segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct
------------+-----------+-----------+---------------+---------------+---------------+--------------
[1] 48 | on | 109.3 tps | 125.1(+ 11.4%)| 157.3(+44.0%) | 160.4(+46.8%) | 161.1(+47.5%)
[2] 3 | on | 102.5 tps | 136.3(+ 33.0%)| 117.6(+14.7%) | |
[3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%)

- 30runs * pgbench -s 100 -c 10 -t 1000
- using 2 ATA disks:
- hda(reiserfs) includes system and wal. writeback-cache is on at [1][2] and off at [3].
- hdc(jfs) includes database files. writeback-cache is always on.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-24 13:37:23
Message-ID: 26018.1119620243@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> ... So I'll post the new results:

> checkpoint_ | writeback |
> segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct
> ------------+-----------+-----------+---------------+---------------+---------------+--------------
> [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%)

Yeah, this is about what I was afraid of: if you're actually fsyncing
then you get at best one commit per disk revolution, and the negotiation
with the OS is down in the noise.

At this point I'm inclined to reject the patch on the grounds that it
adds complexity and portability issues, without actually buying any
useful performance improvement. The write-cache-on numbers are not
going to be interesting to any serious user :-(

regards, tom lane


From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-24 15:19:14
Message-ID: 20050624151914.GN89438@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote:
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > ... So I'll post the new results:
>
> > checkpoint_ | writeback |
> > segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct
> > ------------+-----------+-----------+---------------+---------------+---------------+--------------
> > [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%)
>
> Yeah, this is about what I was afraid of: if you're actually fsyncing
> then you get at best one commit per disk revolution, and the negotiation
> with the OS is down in the noise.
>
> At this point I'm inclined to reject the patch on the grounds that it
> adds complexity and portability issues, without actually buying any
> useful performance improvement. The write-cache-on numbers are not
> going to be interesting to any serious user :-(

Is there anyone with a battery-backed RAID controller that could run
these tests? I suspect that in that case the differences might be closer
to 1 or 2 rather than 3, which would make the patch much more valuable.

Josh, is this something that could be done in the performance lab?
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-24 16:21:56
Message-ID: 200506240921.57054.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Jim,

> Josh, is this something that could be done in the performance lab?

That's the idea. Sadly, OSDL's hardware has been having critical failures of
late (I'm still trying to get test results on the checkpointing thing) and
the GreenPlum machines aren't up yet.

I need to contact those folks in Brazil ...

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-06-28 07:21:10
Message-ID: 20050628161732.402D.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Yeah, this is about what I was afraid of: if you're actually fsyncing
> then you get at best one commit per disk revolution, and the negotiation
> with the OS is down in the noise.

If we disable writeback-cache and use open_sync, the per-page writing
behavior in WAL module will show up as bad result. O_DIRECT is similar
to O_DSYNC (at least on linux), so that the benefit of it will disappear
behind the slow disk revolution.

In the current source, WAL is written as:
for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
Is this intentional? Can we rewrite it as follows?
write(&buffers[0], N * BLCKSZ);

In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
These two patches are independent, so they can be applied either or both.

I tested them on my machine and the results as follows. It shows that
direct-io and gather-write is the best choice when writeback-cache is off.
Are these two patches worth trying if they are used together?

| writeback | fsync= | fdata | open_ | fsync_ | open_
patch | cache | false | sync | sync | direct | direct
------------+-----------+--------+-------+-------+--------+---------
direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2
direct io | on | 129.1 | 112.3 | 114.1 | 142.9 | 144.5
gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A)
both | off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2

- 20runs * pgbench -s 100 -c 50 -t 200
- with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
- using 2 ATA disks:
- hda(reiserfs) includes system and wal.
- hdc(jfs) includes database files. writeback-cache is always on.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment Content-Type Size
xlog.dio.diff application/octet-stream 4.5 KB
xlog.gw.diff application/octet-stream 7.4 KB

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-07-02 20:16:47
Message-ID: 200507022016.j62KGlO07480@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


These patches will require some refactoring and documentation, but I
will do that when I apply it.

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
>
> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
>
> In the current source, WAL is written as:
> for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
> write(&buffers[0], N * BLCKSZ);
>
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
>
>
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
>
>
> | writeback | fsync= | fdata | open_ | fsync_ | open_
> patch | cache | false | sync | sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2
> direct io | on | 129.1 | 112.3 | 114.1 | 142.9 | 144.5
> gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A)
> both | off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2
>
> - 20runs * pgbench -s 100 -c 50 -t 200
> - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
> - hda(reiserfs) includes system and wal.
> - hdc(jfs) includes database files. writeback-cache is always on.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Mark Wong <markw(at)osdl(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-07-07 04:58:31
Message-ID: 20050706215831.018aba3d@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, 24 Jun 2005 09:21:56 -0700
Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> Jim,
>
> > Josh, is this something that could be done in the performance lab?
>
> That's the idea. Sadly, OSDL's hardware has been having critical failures of
> late (I'm still trying to get test results on the checkpointing thing) and
> the GreenPlum machines aren't up yet.

I'm on the verge of having a 4-way opteron system with 4 Adaptec 2200s
scsi controllers attached to eight 10-disk 36GB arrays ready. I believe
there are software tools that'll let you reconfigure the luns from linux
so you wouldn't need physical access. Anyone want time on the system?

Mark


From: "Jeffrey W(dot) Baker" <jwb(at)gghcwest(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-07-14 00:33:45
Message-ID: 1121301225.20950.37.camel@toonses.gghcwest.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, 2005-06-24 at 09:37 -0400, Tom Lane wrote:
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > ... So I'll post the new results:
>
> > checkpoint_ | writeback |
> > segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct
> > ------------+-----------+-----------+---------------+---------------+---------------+--------------
> > [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%)
>
> Yeah, this is about what I was afraid of: if you're actually fsyncing
> then you get at best one commit per disk revolution, and the negotiation
> with the OS is down in the noise.
>
> At this point I'm inclined to reject the patch on the grounds that it
> adds complexity and portability issues, without actually buying any
> useful performance improvement. The write-cache-on numbers are not
> going to be interesting to any serious user :-(

You mean not interesting to people without a UPS. Personally, I'd like
to realize a 50% boost in tps, which is what O_DIRECT buys according to
ITAGAKI Takahiro's posted results.

The batteries on a caching RAID controller can run for days at a
stretch. It's not as dangerous as people make it sound. And anyone
running PG on software RAID is crazy.

-jwb


From: "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-07-14 17:30:39
Message-ID: 1121362239.20950.50.camel@toonses.gghcwest.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, 2005-06-24 at 10:19 -0500, Jim C. Nasby wrote:
> On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote:
> > ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > > ... So I'll post the new results:
> >
> > > checkpoint_ | writeback |
> > > segments | cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct
> > > ------------+-----------+-----------+---------------+---------------+---------------+--------------
> > > [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%)
> >
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
> >
> > At this point I'm inclined to reject the patch on the grounds that it
> > adds complexity and portability issues, without actually buying any
> > useful performance improvement. The write-cache-on numbers are not
> > going to be interesting to any serious user :-(
>
> Is there anyone with a battery-backed RAID controller that could run
> these tests? I suspect that in that case the differences might be closer
> to 1 or 2 rather than 3, which would make the patch much more valuable.

I applied the O_DIRECT patch to 8.0.3 and I tested this on a
battery-backed RAID controller with 128MB of cache and 5 7200RPM SATA
disks. All caches are write-back. The xlog and data are on the same
JFS volume. pgbench was run with a scale factor of 1000 and 100000
total transactions. Clients varied from 10 to 100.

Clients | fsync | open_direct
------------------------------------
10 | 81 | 98 (+21%)
100 | 100 | 105 ( +5%)
------------------------------------

No problems were experienced. The patch seems to give a useful boost!

-jwb


From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Jeffrey W(dot) Baker" <jwb(at)gghcwest(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-07-14 19:34:32
Message-ID: 87vf3dw59z.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


"Jeffrey W. Baker" <jwb(at)gghcwest(dot)com> writes:

> The batteries on a caching RAID controller can run for days at a
> stretch. It's not as dangerous as people make it sound. And anyone
> running PG on software RAID is crazy.

Get back to us after your first hardware failure when your vendor says the
power supply you need is on backorder and won't be available for 48 hours...

(And what's your problem with software raid anyways?)

--
greg


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: "Jeffrey W(dot) Baker" <jwb(at)gghcwest(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] O_DIRECT for WAL writes
Date: 2005-07-14 19:44:18
Message-ID: 42D6C092.60503@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Stark wrote:
> "Jeffrey W. Baker" <jwb(at)gghcwest(dot)com> writes:
>
>
>>The batteries on a caching RAID controller can run for days at a
>>stretch. It's not as dangerous as people make it sound. And anyone
>>running PG on software RAID is crazy.
>
>
> Get back to us after your first hardware failure when your vendor says the
> power supply you need is on backorder and won't be available for 48 hours...
>
> (And what's your problem with software raid anyways?)

I would have to second that. Software raid works just fine.

Sincerely,

Joshua D. Drake

>

--
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-07-23 17:32:30
Message-ID: 200507231732.j6NHWVu04215@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


I have modified and attached your patch for your review. I didn't see
any value to adding new fsync_method values because, to me, O_DIRECT is
basically just like O_SYNC except it doesn't keep a copy of the buffer
in the kernel cache. If you are doing fsync(), I don't see how O_DIRECT
makes any sense because O_DIRECT is writing to disk on every write, and
then what is the fsync() actually doing. This might explain why your
fsync/direct and open/direct performance numbers are almost identical.
Basically, if you are going to use O_DIRECT, why not use open_sync.

What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
and O_DSYNC, so it is automatically used in those cases. And of course,
if your operating system doens't support O_DIRECT, it isn't used.

With your posted performance numbers, perhaps we should favor
fsync_method O_SYNC on platforms that have O_DIRECT even if we don't
support OPEN_DATASYNC, but I bet most platforms that have O_DIRECT also
have O_DATASYNC. Perhaps some folks can run testes once the patch is
applied.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
>
> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
>
> In the current source, WAL is written as:
> for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
> write(&buffers[0], N * BLCKSZ);
>
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
>
>
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
>
>
> | writeback | fsync= | fdata | open_ | fsync_ | open_
> patch | cache | false | sync | sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2
> direct io | on | 129.1 | 112.3 | 114.1 | 142.9 | 144.5
> gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A)
> both | off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2
>
> - 20runs * pgbench -s 100 -c 50 -t 200
> - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
> - hda(reiserfs) includes system and wal.
> - hdc(jfs) includes database files. writeback-cache is always on.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 12.6 KB

From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-07-27 05:50:11
Message-ID: 20050727140214.460E.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Thanks for reviewing!
But the patch does not work on HEAD, because of the changes in BootStrapXLOG().
I send the patch with a fix for it.

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> If you are doing fsync(), I don't see how O_DIRECT
> makes any sense because O_DIRECT is writing to disk on every write, and
> then what is the fsync() actually doing.

It's depends on OSes. Manpage of Linux says,
http://linux.com.hk/PenguinWeb/manpage.jsp?name=open&section=2
File I/O is done directly to/from user space buffers. The I/O is
synchronous, i.e., at the completion of the read(2) or write(2) system
call, data is **guaranteed to have been transferred**.
But manpage of FreeBSD says,
http://www.manpages.info/freebsd/open.2.html
O_DIRECT may be used to minimize or eliminate the cache effects of read-
ing and writing. The system will attempt to avoid caching the data you
read or write. If it cannot avoid caching the data,
it will **minimize the impact the data has on the cache**.

In my understanding, the completion of write() with O_DIRECT does not always
assure an actual write. So there may be difference between O_DIRECT+O_SYNC
and O_DIRECT+fsync(), but I think that is not very often.

> What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
> and O_DSYNC, so it is automatically used in those cases. And of course,
> if your operating system doens't support O_DIRECT, it isn't used.

I agree with your way, where O_DIRECT is automatically used.
I bet the combination of O_DIRECT and O_SYNC is always better than
the case O_SYNC only used.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment Content-Type Size
xlog.c.diff application/octet-stream 12.8 KB

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-07-27 13:38:52
Message-ID: 200507271338.j6RDcqY08532@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro wrote:
> Thanks for reviewing!
> But the patch does not work on HEAD, because of the changes in BootStrapXLOG().
> I send the patch with a fix for it.

Thanks.

> > If you are doing fsync(), I don't see how O_DIRECT
> > makes any sense because O_DIRECT is writing to disk on every write, and
> > then what is the fsync() actually doing.
>
> It's depends on OSes. Manpage of Linux says,
> http://linux.com.hk/PenguinWeb/manpage.jsp?name=open&section=2
> File I/O is done directly to/from user space buffers. The I/O is
> synchronous, i.e., at the completion of the read(2) or write(2) system
> call, data is **guaranteed to have been transferred**.
> But manpage of FreeBSD says,
> http://www.manpages.info/freebsd/open.2.html
> O_DIRECT may be used to minimize or eliminate the cache effects of read-
> ing and writing. The system will attempt to avoid caching the data you
> read or write. If it cannot avoid caching the data,
> it will **minimize the impact the data has on the cache**.
>
> In my understanding, the completion of write() with O_DIRECT does not always
> assure an actual write. So there may be difference between O_DIRECT+O_SYNC
> and O_DIRECT+fsync(), but I think that is not very often.

Yes, I do remember that. I know we _need_ fsync when using O_DIRECT,
but the downside of O_DIRECT (force every write to disk) is the same as
O_SYNC, so it seems if we are using O_DIRECT, we might as well use
O_SYNC too and skip the fsync().

I will add a comment mentioning this.

> > What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
> > and O_DSYNC, so it is automatically used in those cases. And of course,
> > if your operating system doens't support O_DIRECT, it isn't used.
>
> I agree with your way, where O_DIRECT is automatically used.
> I bet the combination of O_DIRECT and O_SYNC is always better than
> the case O_SYNC only used.

OK.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-07-29 03:23:55
Message-ID: 200507290323.j6T3Ntb01826@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Patch applied. Thanks.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Thanks for reviewing!
> But the patch does not work on HEAD, because of the changes in BootStrapXLOG().
> I send the patch with a fix for it.
>
>
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
>
> > If you are doing fsync(), I don't see how O_DIRECT
> > makes any sense because O_DIRECT is writing to disk on every write, and
> > then what is the fsync() actually doing.
>
> It's depends on OSes. Manpage of Linux says,
> http://linux.com.hk/PenguinWeb/manpage.jsp?name=open&section=2
> File I/O is done directly to/from user space buffers. The I/O is
> synchronous, i.e., at the completion of the read(2) or write(2) system
> call, data is **guaranteed to have been transferred**.
> But manpage of FreeBSD says,
> http://www.manpages.info/freebsd/open.2.html
> O_DIRECT may be used to minimize or eliminate the cache effects of read-
> ing and writing. The system will attempt to avoid caching the data you
> read or write. If it cannot avoid caching the data,
> it will **minimize the impact the data has on the cache**.
>
> In my understanding, the completion of write() with O_DIRECT does not always
> assure an actual write. So there may be difference between O_DIRECT+O_SYNC
> and O_DIRECT+fsync(), but I think that is not very often.
>
>
> > What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
> > and O_DSYNC, so it is automatically used in those cases. And of course,
> > if your operating system doens't support O_DIRECT, it isn't used.
>
> I agree with your way, where O_DIRECT is automatically used.
> I bet the combination of O_DIRECT and O_SYNC is always better than
> the case O_SYNC only used.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Mark Wong <markw(at)osdl(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Daniel McNeil <daniel(at)osdl(dot)org>, Mark Haverkamp <markh(at)osdl(dot)org>
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-08-06 21:04:19
Message-ID: 20050806210419.GA31044@osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Here are comments that Daniel McNeil made earlier, which I've neglected
to forward earlier. I've cc'ed him and Mark Havercamp, which some of
you got to meet the other day.

Mark

-----

With O_DIRECT on Linux, when the write() returns the i/o has been
transferred to the disk.

Normally, this i/o will be DMAed directly from user-space to the
device. The current exception is when doing an O_DIRECT write to a
hole in a file. (If an program does a truncate() or lseek()/write()
that makes a file larger, the file system does not allocated space
between the old end of file and the new end of file.) An O_DIRECT
write to hole like this, requires the file system to allocated space,
but there is a race condition between the O_DIRECT write doing the
allocate and then write to initialized the newly allocated data and
any other process that attempts a buffered (page cache) read of the
same area in the file -- it was possible for the read to data from
the allocated region before the O_DIRECT write(). The fix in Linux
is for the O_DIRECT write() to fall back to use buffer i/o to do
the write() and flush the data from the page cache to the disk.

A write() with O_DIRECT only means the data has been transferred to
the disk. Depending on the file system and mount options, it does
not mean the meta data for the file has been written to disk (see
fsync man page). Fsync() will guarantee the data and metadata have
been written to disk.

Lastly, if a disk has a write back cache, an O_DIRECT write() does not
guarantee that the disk has put the data on the physical media.
I think some of the journal file systems now support i/o barriers
on commit which will flush the disk write back cache. (I'm still
looking the kernel code to see how this is done).

Conclusion:

O_DIRECT + fsync() can make sense. It avoids the copying of data
to the page cache before being written and will also guarantee
that the file's metadata is also written to disk. It also
prevents the page cache from filling up with write data that
will never be read (I assume it is only read if a recovery
is necessary - which should be rare). It can also
helps disks with write back cache when using the journaling
file system that use i/o barriers. You would want to use
large writes, since the kernel page cache won't be writing
multiple pages for you.

I need to look at the kernel code more to comment on O_DIRECT with
O_SYNC.

Questions:

Does the database transaction logger preallocate the log file?

Does the logger care about the order in which each write hits the disk?

Now someone else can comment on my comments.

Daniel


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Mark Wong <markw(at)osdl(dot)org>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Daniel McNeil <daniel(at)osdl(dot)org>, Mark Haverkamp <markh(at)osdl(dot)org>
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-08-09 22:21:00
Message-ID: 200508092221.j79ML0j01069@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Mark Wong wrote:
> O_DIRECT + fsync() can make sense. It avoids the copying of data
> to the page cache before being written and will also guarantee
> that the file's metadata is also written to disk. It also
> prevents the page cache from filling up with write data that
> will never be read (I assume it is only read if a recovery
> is necessary - which should be rare). It can also
> helps disks with write back cache when using the journaling
> file system that use i/o barriers. You would want to use
> large writes, since the kernel page cache won't be writing
> multiple pages for you.

Right, but it seems O_DIRECT is pretty much the same as O_DIRECT with
O_DSYNC because the data is always written to disk on write(). Our
logic is that there is nothing for fdatasync to do in most cases after
using O_DIRECT, so the O_DIRECT/fdatasync() combination doesn't make
sense.

And FreeBSD, and perhaps others, need O_SYNC or fdatasync with O_DIRECT
because O_DIRECT doesn't force stuff to disk in all cases.

> I need to look at the kernel code more to comment on O_DIRECT with
> O_SYNC.
>
> Questions:
>
> Does the database transaction logger preallocate the log file?

Yes.

> Does the logger care about the order in which each write hits the disk?

Not really.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Mark Wong <markw(at)osdl(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-08-11 20:31:44
Message-ID: 200508112031.j7BKVFjA003387@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
I'm not sure if I'm showing anything insightful though. I've learned
that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
Linux, which you guys may have already known. There appears to be a
fair performance decrease in using open_sync. Just to double check, am
I correct in understanding only open_sync uses O_DIRECT?

fdatasync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
5462 notpm

open_sync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/
4860 notpm

Mark


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Mark Wong <markw(at)osdl(dot)org>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [HACKERS] O_DIRECT for WAL writes
Date: 2005-08-11 20:36:10
Message-ID: 200508112036.j7BKaAX09363@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Mark Wong wrote:
> Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
> I'm not sure if I'm showing anything insightful though. I've learned
> that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
> Linux, which you guys may have already known. There appears to be a

That is not what we thought for Linux, but many other OS's behave that
way.

> fair performance decrease in using open_sync. Just to double check, am
> I correct in understanding only open_sync uses O_DIRECT?

Right.

> fdatasync
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
> 5462 notpm
>
> open_sync
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/
> 4860 notpm

Right now open_sync is our last choice, which seems to still be valid
for Linux, at least.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073