Re: Filesystem benchmarking for pg 8.3.3 server

From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Henrik <henke(at)mac(dot)se>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Filesystem benchmarking for pg 8.3.3 server
Date: 2008-08-13 22:30:40
Message-ID: 48A36090.6070309@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Scott Marlowe wrote:
>IDE came up corrupted every single time.
Greg Smith wrote:
> you've drank the kool-aid ... completely
> ridiculous ...unsafe fsync ... md0 RAID-1
> array (aren't there issues with md and the barriers?)

Alright - I'll eat my words. Or mostly.

I still haven't found IDE drives that lie; but
if the testing I've done today, I'm starting to
think that:

1a) ext3 fsync() seems to lie badly.
1b) but ext3 can be tricked not to lie (but not
in the way you might think).
2a) md raid1 fsync() sometimes doesn't actually
sync
2b) I can't trick it not to.
3a) some IDE drives don't even pretend to support
letting you know when their cache is flushed
3b) but the kernel will happily tell you about
any such devices; as well as including md
raid ones.

In more detail. I tested on a number of systems
and disks including new (this year) and old (1997)
IDE drives; and EXT3 with and without the "barrier=1"
mount option.

First off - some IDE drives don't even support the
relatively recent ATA command that apparently lets
the software know when a cache flush is complete.
Apparently on those you will get messages in your
system logs:
%dmesg | grep 'disabling barriers'
JBD: barrier-based sync failed on md1 - disabling barriers
JBD: barrier-based sync failed on hda3 - disabling barriers
and
%hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT
will not show you anything on those devices.
IMHO that's cool; and doesn't count as a lying IDE drive
since it didn't claim to support this.

Second of all - ext3 fsync() appears to me to
be *extremely* stupid. It only seems to correctly
do the correct flushing (and waiting) for a drive's
cache to be flushed when a file's inode has changed.
For example, in the test program below, it will happily
do a real fsync (i.e. the program take a couple seconds
to run) so long as I have the "fchmod()" statements are in
there. It will *NOT* wait on my system if I comment those
fchmod()'s out. Sadly, I get the same behavior with and
without the ext3 barrier=1 mount option. :(
==========================================================
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[]) {
if (argc<2) {
printf("usage: fs <filename>\n");
exit(1);
}
int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
int i;
for (i=0;i<100;i++) {
char byte;
pwrite (fd, &byte, 1, 0);
fchmod (fd, 0644); fchmod (fd, 0664);
fsync (fd);
}
}
==========================================================
Since it does indeed wait when the inode's touched, I think
it suggests that it's not the hard drive that's lying, but
rather ext3.

So I take back what I said about linux and write barriers
being sane. They're not.

But AFACT, all the (6 different) IDE drives I've seen work
as advertised, and the kernel happily seems to spews boot
messages when it finds one that doesn't support knowing
when a cache flush finished.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-08-14 00:53:10 Re: autovacuum: use case for indenpedent TOAST table autovac settings
Previous Message Alvaro Herrera 2008-08-13 21:28:38 autovacuum: use case for indenpedent TOAST table autovac settings