Proposing correction to posix_fadvise() usage in xlog.c

Lists: pgsql-patches
From: "Mark Wong" <markwkm(at)gmail(dot)com>
To: pgsql-patches(at)postgresql(dot)org
Subject: Proposing correction to posix_fadvise() usage in xlog.c
Date: 2008-03-01 05:40:24
Message-ID: 70c01d1d0802292140h2358ad31g2122e972dfc080cd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

I believe I have a correction to the usage of posix_fadvise() in
xlog.c. Basically posix_fadvise() is being called right before the
WAL segment file is closed, which effectively doesn't do anything as
opposed to when the file is opened. This proposed correction calls
posix_fadvise() in three locations in order to make sure
POSIX_FADV_DONTNEED is set correctly since there are three cases for
opening a WAL segment file for writing.

I'm hesitant to post any data I have because I only have a little pc
with a SATA drive in it. My hardware knowledge on SATA controllers
and drives is a little weak, but my testing with dbt-2 is showing the
performance dropping. I am guessing that SATA drives have write cache
enabled by default so it seems to make sense that using
POSIX_FADV_DONTNEED will cause writes to be slower by writing through
the disk cache. Again, assuming that is possible with SATA hardware.

If memory serves, one of the wins here is suppose to be that in a
scenario where we are not expecting to re-read writes to the WAL we
also do not want the writes to disk to flush out other data from the
operating system disk cache. But I'm not sure how best to test
correctness.

Anyway, I hope I'm not way off but I'm sure someone will correct me. :)

Regards,
Mark

Attachment Content-Type Size
pgsql-log-fadvise.patch application/octet-stream 2.3 KB

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Mark Wong <markwkm(at)gmail(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Proposing correction to posix_fadvise() usage in xlog.c
Date: 2008-03-01 10:06:27
Message-ID: Pine.GSO.4.64.0803010438530.29368@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Fri, 29 Feb 2008, Mark Wong wrote:

> Basically posix_fadvise() is being called right before the WAL segment
> file is closed, which effectively doesn't do anything as opposed to when
> the file is opened. This proposed correction calls posix_fadvise() in
> three locations...

Actually, posix_fadvise is called nowhere; the one place it's referenced
at is #ifdef'd out. There's a comment in the code just above there as to
why: posix_fadvise only works on a limited number of platforms and as far
as I know nobody has ever put the time into figuring out where it's safe
or not safe so that configure can be taught that. I think this may be a
dead item because most places where posix_fadvise works correctly, you can
use O_SYNC and get O_DIRECT right now to do the same thing.

> If memory serves, one of the wins here is suppose to be that in a
> scenario where we are not expecting to re-read writes to the WAL we also
> do not want the writes to disk to flush out other data from the
> operating system disk cache.

Right, but you can get that already with O_SYNC on platforms where
O_DIRECT is supported.

There's a related TODO here which is to use directio(3C) on Solaris, which
Jignesh reports is needed instead of O_DIRECT to get the same behavior on
that platform.

> I am guessing that SATA drives have write cache enabled by default so it
> seems to make sense that using POSIX_FADV_DONTNEED will cause writes to
> be slower by writing through the disk cache.

I've never heard of a SATA drive that had its write cache disabled by
default. They're always on unless you force them off, and even then they
can turn themselves back on again if there's a device reset and you didn't
change the drive's default using the manufacturer's adjustment utility.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD